Classification workflows monad
Introduction
In this document we describe the design and implementation of a (software programming) monad for classification workflows specification and execution. The design and implementation are done with Mathematica / Wolfram Language (WL).
The goal of the monad design is to make the specification of classification workflows (relatively) easy, straightforward, by following a certain main scenario and specifying variations over that scenario.
The monad is named ClCon and it is based on the State monad package , [, ], the classifier ensembles package , [, ], and the package for functions calculation and plotting , [, , ].
The data for this document is read from WL’s repository using the package , [].
The monadic programming design is used as a . The ClCon monad can be also seen as a (DSL) for the specification and programming of machine learning classification workflows.
Here is an example of using the ClCon monad over the Titanic data:
Out[29]=
ClConUnit[dsTitanic] ⟹ | (* lift the data to the monad *) |
ClConSplitData[0.75`] ⟹ | (* split the data *) |
ClConMakeClassifier["LogisticRegression"] ⟹ | (* create a classifier *) |
ClConClassifierMeasurements[{"Accuracy","Precision","Recall"}] ⟹ | (* compute classifier measurements *) |
ClConEchoValue | (* display the current pipeline value *) |
»
value:Accuracy0.75841,Precisiondied0.818653,survived0.671642,Recalldied0.782178,survived0.72
Remark: The table above is produced with the of the paclet and some of the explanations below also utilize that package.
As it was mentioned above the monad ClCon can be seen as a DSL. Because of this the monad pipelines made with ClCon are sometimes called “specifications”.
Paclet load
Load the paclet
In[12]:=
Needs["AntonAntonov`MonadicContextualClassification`"]
Data load
In this section we load data that is used in the rest of the document. The “quick” data is created in order to specify quick, illustrative computations.
Remark: In all datasets the classification labels are in the last column.
The summarization of the data is done through ClCon, which in turn uses the function RecordsSummary from the paclet .
WL resources data
WL resources data
In[81]:=
dsTitanic=ResourceFunction["ExampleDataset"][{"MachineLearning","Titanic"}];dsMushroom=ResourceFunction["ExampleDataset"][{"MachineLearning","Mushroom"}];dsWineQuality=ResourceFunction["ExampleDataset"][{"MachineLearning","WineQuality"}];dsWineQuality=(Join[KeyDrop[#1,{"wine quality (score between 1-10)"}],Association["wineQuality"#1["wine quality (score between 1-10)"]]]&)/@dsWineQuality;
Here is are the dimensions of the datasets:
In[85]:=
Dataset[Dataset[Map[Prepend[Dimensions[ToExpression[#]],#]&,{"dsTitanic","dsMushroom","dsWineQuality"}]][All,AssociationThread[{"name","rows","columns"},#]&]]
Out[85]=
|
Here is the summary of dsTitanic:
In[86]:=
ClConUnit |
ClConSummarizeData |
»
summaries:Anonymous
,
,
,
1 passenger class | ||||||
|
2 passenger age | ||||||||||||||
|
3 passenger sex | ||||
|
4 passenger survival | ||||
|
Here is the summary of dsMushroom in long form:
In[87]:=
ClConUnit |
ClConSummarizeDataLongForm |
»
summaries:Anonymous
,
,
1 RowID | ||||||||||||||||||||||||
|
2 Variable | ||||||||||||||||||||||||
|
3 Value | ||||||||||||||||||||||||
|
Here is the summary of dsWineQuality in long form:
In[88]:=
ClConUnit |
ClConSummarizeDataLongForm |
»
summaries:Anonymous
,
,
1 RowID | ||||||||||||||||||||||||
|
2 Variable | ||||||||||||||||||||||||
|
3 Value | ||||||||||||
|
“Quick” data
“Quick” data
In this subsection we make up some data that is used for illustrative purposes.
In[20]:=
SeedRandom[212]dsData=RandomInteger[{0,1000},{100}];dsData=Dataset[Transpose[{dsData,Mod[dsData,3],Last@*IntegerDigits/@dsData,ToString[Mod[#,3]]&/@dsData}]];dsData=Dataset[dsData[All,AssociationThread[{"number","feature1","feature2","label"},#]&]];Dimensions[dsData]
Out[20]=
RandomGeneratorState
|
Out[24]=
{100,4}
Here is a sample of the data:
In[25]:=
RandomSample[dsData,6]
Out[25]=
|
Here is a summary of the data:
In[26]:=
ClConUnit |
ClConSummarizeData |
»
summaries:Anonymous
,
,
,
1 number | ||||||||||||
|
2 feature1 | ||||||||||||
|
3 feature2 | ||||||||||||
|
4 label | ||||||
|
Here we convert the data into a list of record-label rules (and show the summary):
In[27]:=
mlrData=
[dsData];
[mlrData]⟹
;
ClConToNormalClassifierData |
ClConUnit |
ClConSummarizeData |
»
summaries:Anonymous
,
,
1 column 1 | ||||||||||||
|
2 column 2 | ||||||||||||
|
3 column 3 | ||||||||||||
|
1 column 1 | ||||||
|
Finally, we make the array version of the dataset:
Design considerations
The steps of the main classification workflow addressed in this document follow.
1
.Retrieving data from a data repository.
2
.Optionally, transform the data.
3
.Split data into training and test parts.
3
.1
.Optionally, split training data into training and validation parts.
4
.Make a classifier with the training data.
5
.Test the classifier over the test data.
5
.1
.Computation of different measures including ROC.
The following diagram shows the steps:
Very often the workflow above is too simple in real situations. Often when making “real world” classifiers we have to experiment with different transformations, different classifier algorithms, and parameters for both transformations and classifiers. Examine the following mind-map that outlines the activities in making competition classifiers.
In view of the mind-map above we can come up with the following flow-chart that is an elaboration on the main, simple workflow flow-chart.
In order to address:
◼
the introduction of new elements in classification workflows,
◼
workflows elements variability, and
◼
workflows iterative changes and refining,
[...] The monad represents computations with a sequential structure: a monad defines what it means to chain operations together. This enables the programmer to build pipelines that process data in a series of steps (i.e. a series of actions applied to the data), in which each action is decorated with the additional processing rules provided by the monad. [...]
Monads allow a programming style where programs are written by putting together highly composable parts, combining in flexible ways the possible actions that can work on a particular type of data. [...]
Monads allow a programming style where programs are written by putting together highly composable parts, combining in flexible ways the possible actions that can work on a particular type of data. [...]
Monad design
The monad we consider is designed to speed-up the programming of classification workflows outlined in the previous section. The monad is named ClCon for “Classification with Context”.
We want to be able to construct monad pipelines of the general form:
This means that some monad operations will not just change the pipeline value but they will also change the pipeline context.
In the monad pipelines of ClCon we store different objects in the contexts for at least one of the following two reasons.
1
.The object will be needed later on in the pipeline, or
2
.The object is hard to compute.
Such objects are training data, ROC data, and classifiers.
Let us list the desired properties of the monad.
◼
Rapid specification of non-trivial classification workflows.
◼
The monad works with different data types: Dataset, lists of machine learning rules, full arrays.
◼
The pipeline values can be of different types. Most monad functions modify the pipeline value; some modify the context; some just echo results.
◼
The monad works with single classifier objects and with classifier ensembles.
◼
This means support of different classifier measures and ROC plots for both single classifiers and classifier ensembles.
◼
The monad allows of cursory examination and summarization of the data.
◼
For insight and in order to verify assumptions.
◼
The monad has operations to compute importance of variables.
◼
We can easily obtain the pipeline value, context, and different context objects for manipulation outside of the monad.
◼
We can calculate classification measures using a specified ROC parameter and a class label.
◼
We can easily plot different combinations of ROC functions.
The ClCon components and their interaction are given in the following diagram. (The components correspond to the main workflow given in the previous section.)
In the diagram above the operations are given in rectangles. Data objects are given in round corner rectangles and classifier objects are given in round corner squares.
The main ClCon operations implicitly put in the context or utilize from the context the following objects:
◼
training data,
◼
test data,
◼
validation data,
◼
classifier (a classifier function or an association of classifier functions),
◼
ROC data,
◼
variable names list.
Note the that the monadic set of types of ClCon pipeline values is fairly heterogenous and certain awareness of “the current pipeline value” is assumed when composing ClCon pipelines.
ClCon overview
When using a monad we lift certain data into the “monad space”, using monad’s operations we navigate computations in that space, and at some point we take results from it.
With the approach taken in this document the “lifting” into the ClCon monad is done with the function ClConUnit. Results from the monad can be obtained with the functions ClConTakeValue, ClConContext, or with the other ClCon functions with the prefix “ClConTake” (see below.)
Here is a corresponding diagram of a generic computation with the ClCon monad:
Remark: It is a good idea to compare the diagram with formulas (1) and (2).
Let us examine a concrete ClCon pipeline that corresponds to the diagram above. In the following table each pipeline operation is combined together with a short explanation and the context keys after its execution.
Here is the output of the pipeline:
In the specified pipeline computation the last column of the dataset is assumed to be the one with the class labels.
The ClCon functions are separated into four groups:
◼
operations,
◼
setters,
◼
takers,
◼
State Monad generic functions.
An overview of the those functions is given in the tables in next two sub-sections. The next section, “Monad elements”, gives details and examples for the usage of the ClCon operations.
Monad elements
In this section we show that ClCon has all of the properties listed in the previous section.
The monad head
The monad head
The monad head is ClCon. Anything wrapped in ClCon can serve as monad’s pipeline value. It is better though to use the constructor ClConUnit. (Which adheres to the definition in [Wk1].)
Lifting data to the monad
Lifting data to the monad
The function lifting the data into the monad ClCon is ClConUnit.
The lifting to the monad marks the beginning of the monadic pipeline. It can be done with data or without data. Examples follow.
(See the sub-section “Setters and takers” for more details of setting and taking values in ClCon contexts.)
Currently the monad can deal with data in the following forms:
◼
datasets,
◼
matrices,
◼
lists of examplelabel rules.
The ClCon monad also has the non-monadic function ClConToNormalClassifierData which can be used to convert datasets and matrices to lists of examplelabel rules. Here is an example:
When the data lifted to the monad is a dataset or a matrix it is assumed that the last column has the class labels. WL makes it easy to rearrange columns in such a way the any column of dataset or a matrix to be the last.
Data splitting
Data splitting
The splitting is made with ClConSplitData, which takes up to two arguments and options. The first argument specifies the fraction of training data. The second argument -- if given -- specifies the fraction of the validation part of the training data. If the value of option Method is ”LabelsProportional”, then the splitting is done in correspondence of the class labels tallies. (”LabelsProportional” is the default value.) Data splitting demonstration examples follow.
Here are the dimensions of the dataset dsData:
Note that if Method is not “LabelsProportional” we get slightly different results.
Classifier training
Classifier training
Single classifier training
Single classifier training
With the following pipeline we take the Titanic data, split it into 75/25 % parts, train a Logistic Regression classifier, and finally take that classifier from the monad.
Here is information about the obtained classifier:
Classifier ensemble training
Classifier ensemble training
The classifier ensemble is simply an association with keys that are automatically assigned names and corresponding values that are classifiers.
Here are the training times of the classifiers in the obtained ensemble:
A more precise specification can be given using associations. The specification
Here is a pipeline specification equivalent to the pipeline specification above:
Classifier testing
Classifier testing
Classifier testing is done with the testing data in the context.
Here is a pipeline that takes the Titanic data, splits it, and trains a classifier:
Here is how we compute selected classifier measures:
Here we show the confusion matrix plot:
Here is how we plot ROC curves by specifying the ROC parameter range and the image size:
Note of the “ClConROC*Plot” functions automatically echo the plots. The plots are also made to be the pipeline value. Using the option specification “Echo”False the automatic echoing of plots can be suppressed. With the option “ClassLabels” we can focus on specific class labels.
Variable importance finding
Variable importance finding
Using the option “ClassLabels” we can focus on specific class labels:
Setters and takers
Setters and takers
The values from the monad context can be set or obtained with the corresponding “setter” and “taker” functions as summarized in a previous section.
For example:
If other values are put in the context they can be obtained through the (generic) function ClConTakeContext, [AAp1]:
Another generic function from [AAp1] is ClConTakeValue (used many times above.)
Example use cases
Classification with MNIST data
Classification with MNIST data
Here we plot the ROC curve for a specified digit:
Conditional continuation
Conditional continuation
In this sub-section we show how the computations in a ClCon pipeline can be stopped or continued based on a certain condition.
The pipeline below makes a simple classifier (“LogisticRegression”) for the WineQuality data, and if the recall for the important label (“high”) is not large enough makes a more complicated classifier (“RandomForest”). The pipeline marks intermediate steps by echoing outcomes and messages.
We can see that the recall with the more complicated is classifier is higher. Also the ROC plots of the second classifier are visibly closer to the ideal one. Still, the recall is not good enough, we have to find a threshold that is better that the default one. (See the next sub-section.)
Classification with custom thresholds
Classification with custom thresholds
(In this sub-section we use the monad from the previous sub-section.)
We can see that the recall for “high” is fairly large and the rest of the measures have satisfactory values. (The accuracy did not drop that much, and the false positive rate is not that large.)
Here we compute suggestions for the best thresholds:
Here is a way to use threshold suggestions within the monad pipeline: