MonadicContextualClassification | Wolfram Resource Object

Classification workflows monad

Introduction	Monad design
Paclet load	ClCon overview
Data load	Monad elements
Design considerations	Example use cases

Introduction

In this document we describe the design and implementation of a (software programming) monad for classification workflows specification and execution. The design and implementation are done with Mathematica / Wolfram Language (WL).

The goal of the monad design is to make the specification of classification workflows (relatively) easy, straightforward, by following a certain main scenario and specifying variations over that scenario.

The monad is named ClCon and it is based on the State monad package

“StateMonadCodeGenerator.m”

, [

AAp1

AA1

], the classifier ensembles package

“ClassifierEnsembles.m”

, [

AAp4

AA2

], and the package for

Receiver Operating Characteristic (ROC)

functions calculation and plotting

, [

The data for this document is read from WL’s repository using the package

“GetMachineLearningDataset.m”

, [

AAp10

The monadic programming design is used as a

Software Design Pattern

. The ClCon monad can be also seen as a

Domain Specific Language

(DSL) for the specification and programming of machine learning classification workflows.

Here is an example of using the ClCon monad over the Titanic data:

Out[29]=

ClConUnit[dsTitanic] ⟹	(* lift the data to the monad *)
ClConSplitData[0.75`] ⟹	(* split the data *)
ClConMakeClassifier["LogisticRegression"] ⟹	(* create a classifier *)
ClConClassifierMeasurements[{"Accuracy","Precision","Recall"}] ⟹	(* compute classifier measurements *)
ClConEchoValue	(* display the current pipeline value *)

value:Accuracy0.75841,Precisiondied0.818653,survived0.671642,Recalldied0.782178,survived0.72

Remark: The table above is produced with the

TraceMonad

of the paclet

MonadMakers

and some of the explanations below also utilize that package.

As it was mentioned above the monad ClCon can be seen as a DSL. Because of this the monad pipelines made with ClCon are sometimes called “specifications”.

Paclet load

The following commands load the packages [

AAp1

--AAp10, AAp12]:

Load the paclet

In[12]:=

Needs["AntonAntonov`MonadicContextualClassification`"]

Data load

In this section we load data that is used in the rest of the document. The “quick” data is created in order to specify quick, illustrative computations.

Remark: In all datasets the classification labels are in the last column.

The summarization of the data is done through ClCon, which in turn uses the function RecordsSummary from the paclet

DataReshapers

WL resources data

The following commands produce datasets using the package [

AAp10

] (that utilizes ExampleData):

In[81]:=

dsTitanic=ResourceFunction["ExampleDataset"][{"MachineLearning","Titanic"}];dsMushroom=ResourceFunction["ExampleDataset"][{"MachineLearning","Mushroom"}];dsWineQuality=ResourceFunction["ExampleDataset"][{"MachineLearning","WineQuality"}];dsWineQuality=(Join[KeyDrop[#1,{"wine quality (score between 1-10)"}],Association["wineQuality"#1["wine quality (score between 1-10)"]]]&)/@dsWineQuality;

Here is are the dimensions of the datasets:

In[85]:=

Dataset[Dataset[Map[Prepend[Dimensions[ToExpression[#]],#]&,{"dsTitanic","dsMushroom","dsWineQuality"}]][All,AssociationThread[{"name","rows","columns"},#]&]]

Out[85]=

name	rows	columns
dsTitanic	1309	4
dsMushroom	8124	23
dsWineQuality	4898	12

Here is the summary of dsTitanic:

In[86]:=

ClConUnit

[dsTitanic]⟹

ClConSummarizeData

["MaxTallies"12];

summaries:Anonymous

1 passenger class

3rd	709
1st	323
2nd	277

2 passenger age

Min	0.1667
1st Qu	21.
Median	28.
Mean	29.8811
3rd Qu	39.
Max	80.
Missing[___]	263

3 passenger sex

male	843
female	466

4 passenger survival

died	809
survived	500



Here is the summary of dsMushroom in long form:

In[87]:=

ClConUnit

[dsMushroom]⟹

ClConSummarizeDataLongForm

["MaxTallies"12];

summaries:Anonymous

1 RowID

1	23
10	23
100	23
1000	23
1001	23
1002	23
1003	23
1004	23
1005	23
1006	23
1007	23
(Other)	129559

2 Variable

bruises?	5644
cap-color	5644
cap-shape	5644
cap-surface	5644
edibility of mushroom (either edible or poisonous)	5644
gill-attachment	5644
gill-color	5644
gill-size	5644
gill-spacing	5644
habitat	5644
odor	5644
(Other)	67728

3 Value

white	13854
smooth	8540
partial	5644
free	5626
one	5488
brown	5012
broad	4940
close	4620
bulbous	3776
gray	3504
pink	3496
(Other)	65312



Here is the summary of dsWineQuality in long form:

In[88]:=

ClConUnit

[dsWineQuality]⟹

ClConSummarizeDataLongForm

["MaxTallies"12];

summaries:Anonymous

1 RowID

1	12
10	12
100	12
1000	12
1001	12
1002	12
1003	12
1004	12
1005	12
1006	12
1007	12
(Other)	58625

2 Variable

alcohol	4898
chlorides	4898
density	4898
fixed acidity	4898
free sulfur dioxide	4898
pH	4898
residual sugar	4898
sulphates	4898
total sulfur dioxide	4898
volatile acidity	4898
wineQuality	4898
citric acid	4879

3 Value

Min	0.009
1st Qu	0.42
Median	3.22
3rd Qu	9.4
Mean	17.3921
Max	440.



“Quick” data

In this subsection we make up some data that is used for illustrative purposes.

In[20]:=

SeedRandom[212]dsData=RandomInteger[{0,1000},{100}];dsData=Dataset[Transpose[{dsData,Mod[dsData,3],Last@*IntegerDigits/@dsData,ToString[Mod[#,3]]&/@dsData}]];dsData=Dataset[dsData[All,AssociationThread[{"number","feature1","feature2","label"},#]&]];Dimensions[dsData]

Out[20]=

RandomGeneratorState

Method: ExtendedCA

State hash: 7456130942861299876



Out[24]=

{100,4}

Here is a sample of the data:

In[25]:=

RandomSample[dsData,6]

Out[25]=

number	feature1	feature2	label
199	1	9	1
288	0	8	0
96	0	6	0
990	0	0	0
705	0	5	0
565	1	5	1

Here is a summary of the data:

In[26]:=

ClConUnit

[dsData]⟹

ClConSummarizeData

;

summaries:Anonymous

1 number

Min	20
1st Qu	278
Mean	531.17
Median	531.5
3rd Qu	801.5
Max	998

2 feature1

1st Qu	0
Min	0
Mean	0.92
Median	1
3rd Qu	2
Max	2

3 feature2

Min	0
1st Qu	2
Mean	4.27
Median	4.5
3rd Qu	6
Max	9

4 label

0	40
2	32
1	28



Here we convert the data into a list of record-label rules (and show the summary):

In[27]:=

mlrData=

ClConToNormalClassifierData

[dsData];

ClConUnit

[mlrData]⟹

ClConSummarizeData

;

summaries:Anonymous

1 column 1

Min	20
1st Qu	278
Mean	531.17
Median	531.5
3rd Qu	801.5
Max	998

2 column 2

1st Qu	0
Min	0
Mean	0.92
Median	1
3rd Qu	2
Max	2

3 column 3

Min	0
1st Qu	2
Mean	4.27
Median	4.5
3rd Qu	6
Max	9



1 column 1

0	40
2	32
1	28



Finally, we make the array version of the dataset:

Design considerations

The steps of the main classification workflow addressed in this document follow.

Retrieving data from a data repository.

Optionally, transform the data.

Split data into training and test parts.

Optionally, split training data into training and validation parts.

Make a classifier with the training data.

Test the classifier over the test data.

Computation of different measures including ROC.

The following diagram shows the steps:

Very often the workflow above is too simple in real situations. Often when making “real world” classifiers we have to experiment with different transformations, different classifier algorithms, and parameters for both transformations and classifiers. Examine the following mind-map that outlines the activities in making competition classifiers.

In view of the mind-map above we can come up with the following flow-chart that is an elaboration on the main, simple workflow flow-chart.

In order to address:

◼

the introduction of new elements in classification workflows,

◼

workflows elements variability, and

◼

workflows iterative changes and refining,

[...] The monad represents computations with a sequential structure: a monad defines what it means to chain operations together. This enables the programmer to build pipelines that process data in a series of steps (i.e. a series of actions applied to the data), in which each action is decorated with the additional processing rules provided by the monad. [...]
Monads allow a programming style where programs are written by putting together highly composable parts, combining in flexible ways the possible actions that can work on a particular type of data. [...]

Monad design

The monad we consider is designed to speed-up the programming of classification workflows outlined in the previous section. The monad is named ClCon for “Classification with Context”.

We want to be able to construct monad pipelines of the general form:

This means that some monad operations will not just change the pipeline value but they will also change the pipeline context.

In the monad pipelines of ClCon we store different objects in the contexts for at least one of the following two reasons.

The object will be needed later on in the pipeline, or

The object is hard to compute.

Such objects are training data, ROC data, and classifiers.

Let us list the desired properties of the monad.

◼

Rapid specification of non-trivial classification workflows.

◼

The monad works with different data types: Dataset, lists of machine learning rules, full arrays.

◼

The pipeline values can be of different types. Most monad functions modify the pipeline value; some modify the context; some just echo results.

◼

The monad works with single classifier objects and with classifier ensembles.

◼

This means support of different classifier measures and ROC plots for both single classifiers and classifier ensembles.

◼

The monad allows of cursory examination and summarization of the data.

◼

For insight and in order to verify assumptions.

◼

The monad has operations to compute importance of variables.

◼

We can easily obtain the pipeline value, context, and different context objects for manipulation outside of the monad.

◼

We can calculate classification measures using a specified ROC parameter and a class label.

◼

We can easily plot different combinations of ROC functions.

The ClCon components and their interaction are given in the following diagram. (The components correspond to the main workflow given in the previous section.)

In the diagram above the operations are given in rectangles. Data objects are given in round corner rectangles and classifier objects are given in round corner squares.

The main ClCon operations implicitly put in the context or utilize from the context the following objects:

◼

training data,

◼

test data,

◼

validation data,

◼

classifier (a classifier function or an association of classifier functions),

◼

ROC data,

◼

variable names list.

Note the that the monadic set of types of ClCon pipeline values is fairly heterogenous and certain awareness of “the current pipeline value” is assumed when composing ClCon pipelines.

ClCon overview

When using a monad we lift certain data into the “monad space”, using monad’s operations we navigate computations in that space, and at some point we take results from it.

With the approach taken in this document the “lifting” into the ClCon monad is done with the function ClConUnit. Results from the monad can be obtained with the functions ClConTakeValue, ClConContext, or with the other ClCon functions with the prefix “ClConTake” (see below.)

Here is a corresponding diagram of a generic computation with the ClCon monad:

Remark: It is a good idea to compare the diagram with formulas (1) and (2).

Let us examine a concrete ClCon pipeline that corresponds to the diagram above. In the following table each pipeline operation is combined together with a short explanation and the context keys after its execution.

Here is the output of the pipeline:

In the specified pipeline computation the last column of the dataset is assumed to be the one with the class labels.

The ClCon functions are separated into four groups:

◼

operations,

◼

setters,

◼

takers,

◼

State Monad generic functions.

An overview of the those functions is given in the tables in next two sub-sections. The next section, “Monad elements”, gives details and examples for the usage of the ClCon operations.

Monad elements

In this section we show that ClCon has all of the properties listed in the previous section.

The monad head

The monad head is ClCon. Anything wrapped in ClCon can serve as monad’s pipeline value. It is better though to use the constructor ClConUnit. (Which adheres to the definition in [Wk1].)

Lifting data to the monad

The function lifting the data into the monad ClCon is ClConUnit.

The lifting to the monad marks the beginning of the monadic pipeline. It can be done with data or without data. Examples follow.

(See the sub-section “Setters and takers” for more details of setting and taking values in ClCon contexts.)

Currently the monad can deal with data in the following forms:

◼

datasets,

◼

matrices,

◼

lists of examplelabel rules.

The ClCon monad also has the non-monadic function ClConToNormalClassifierData which can be used to convert datasets and matrices to lists of examplelabel rules. Here is an example:

When the data lifted to the monad is a dataset or a matrix it is assumed that the last column has the class labels. WL makes it easy to rearrange columns in such a way the any column of dataset or a matrix to be the last.

Data splitting

The splitting is made with ClConSplitData, which takes up to two arguments and options. The first argument specifies the fraction of training data. The second argument -- if given -- specifies the fraction of the validation part of the training data. If the value of option Method is ”LabelsProportional”, then the splitting is done in correspondence of the class labels tallies. (”LabelsProportional” is the default value.) Data splitting demonstration examples follow.

Here are the dimensions of the dataset dsData:

Note that if Method is not “LabelsProportional” we get slightly different results.

Classifier training

Single classifier training

With the following pipeline we take the Titanic data, split it into 75/25 % parts, train a Logistic Regression classifier, and finally take that classifier from the monad.

Here is information about the obtained classifier:

Classifier ensemble training

The classifier ensemble is simply an association with keys that are automatically assigned names and corresponding values that are classifiers.

Here are the training times of the classifiers in the obtained ensemble:

A more precise specification can be given using associations. The specification

Here is a pipeline specification equivalent to the pipeline specification above:

Classifier testing

Classifier testing is done with the testing data in the context.

Here is a pipeline that takes the Titanic data, splits it, and trains a classifier:

Here is how we compute selected classifier measures:

Here we show the confusion matrix plot:

Here is how we plot ROC curves by specifying the ROC parameter range and the image size:

Note of the “ClConROC*Plot” functions automatically echo the plots. The plots are also made to be the pipeline value. Using the option specification “Echo”False the automatic echoing of plots can be suppressed. With the option “ClassLabels” we can focus on specific class labels.

Variable importance finding

Using the option “ClassLabels” we can focus on specific class labels:

Setters and takers

The values from the monad context can be set or obtained with the corresponding “setter” and “taker” functions as summarized in a previous section.

For example:

If other values are put in the context they can be obtained through the (generic) function ClConTakeContext, [AAp1]:

Another generic function from [AAp1] is ClConTakeValue (used many times above.)

Example use cases

Classification with MNIST data

Here we plot the ROC curve for a specified digit:

Conditional continuation

In this sub-section we show how the computations in a ClCon pipeline can be stopped or continued based on a certain condition.

The pipeline below makes a simple classifier (“LogisticRegression”) for the WineQuality data, and if the recall for the important label (“high”) is not large enough makes a more complicated classifier (“RandomForest”). The pipeline marks intermediate steps by echoing outcomes and messages.

We can see that the recall with the more complicated is classifier is higher. Also the ROC plots of the second classifier are visibly closer to the ideal one. Still, the recall is not good enough, we have to find a threshold that is better that the default one. (See the next sub-section.)

Classification with custom thresholds

(In this sub-section we use the monad from the previous sub-section.)

We can see that the recall for “high” is fairly large and the rest of the measures have satisfactory values. (The accuracy did not drop that much, and the false positive rate is not that large.)

Here we compute suggestions for the best thresholds:

Here is a way to use threshold suggestions within the monad pipeline: