This notebook implements a machine learning classification algorithm for functional group identification in vibrational spectroscopy. In chemistry, functional groups are particular arrangements of atoms and bonds that appear in many different molecules. This analysis focuses on the carbonyl functional group, which contains a carbon atom double bonded to an oxygen atom. The carbonyl functional group itself appears in a number of other functional groups, including ketones and carboxylic acids. Functional groups have distinct signatures in vibrational spectroscopy due to their characteristic arrangement of atoms and bonds. The problem of functional group identification entails predicting which functional groups are present in a particular molecule given its spectrum. This notebook walks through the steps to train a multiclass machine learning classification model to predict whether molecules contain ketones, carboxylic acids, or other carbonyl groups based on their infrared (IR) absorption spectra. The model performance is then evaluated using a test data set. This activity was developed at Fordham University for a physical chemistry laboratory course taken by junior and senior chemistry majors.
Objectives
◼
Gain proficiency in reading and modifying Mathematica code in the notebook environment.
◼
Build machine learning multiclass classification models to predict whether a molecule contains a ketone, a carboxylic acid, or another type of carbonyl functional group using IR spectroscopy data.
Initialization
The Synthetic Minority Oversampling Technique (SMOTE) function is provided as an initialization cell and will be loaded when you first evaluate anything in this notebook. It will be used during the data preprocessing step.
DefineSMOTEfunction
Get Data
We will need a training dataset and a test dataset. These data are computational IR absorption spectra obtained from the Alexandria Library (https://www.nature.com/articles/sdata201862).
◼
In this case, all molecules in both datasets contain a carbonyl group, so the three classes are: ketone, carboxylic acid, or other carbonyl group.
◼
The training data is saved in three separate files: one with a label indicating whether or not a ketone is present, one with a label indicating whether or not a carboxylic acid is present, and one with a label indicating whether or not another carbonyl is present (i.e., whether neither ketone nor carboxylic acid are present).
◼
The test data set is stored in a single file with three columns corresponding to the labels for ketone, carboxylic acid, and other.
◼
In all cases, the presence of a particular group is indicated by the label 1 and the absence by the label 0.
Run the following code cell to retrieve the training and test data for this exercise:
Let’s display the contents of a few of these variables to familiarize ourselves with their structure. First, the first 10 rows of the ketone training data:
In[]:=
ketone[[1;;10]]
Out[]=
Then the test data:
In[]:=
testData[[1;;10]]
Out[]=
Data Preprocessing
We will need to preprocess the data before carrying out the machine learning analysis. We will perform four preprocessing steps:
◼
Split the attribute and label
◼
Normalization
◼
Thresholding
◼
Data balancing
Split Attribute and Label
Notice that the training and test data contain the molecule name and label in addition to the spectral data. Our first task is to separate the information about whether or not the molecule contains a carbonyl (the attribute we want to predict, let’s call it “Y”) from the spectral intensities (the input features that we will use to make a decision, let’s call it “X”). We will also convert this into a form that is more amenable to machine learning analysis, where our goal is to find a function that describes Y = f(X). We begin by defining a function that splits the attributes and labels:
◼
The first argument in the square brackets following the function name represents the data to be split.
◼
The second argument, startX, represents the column index where the frequency data starts. If you don't provide this argument, the function uses a default value of 5.
◼
The third argument, endX, represents the column index where the frequency data end. If you don’t provide this argument, the function uses a default value of -1 (the last element)
◼
The fourth argument, startY, represents the first column containing the label data. If you don’t provide this argument, the function uses a default value of 4.
◼
The fifth argument, endY, represents the last column containing the label data. If you don’t provide this argument, the function uses a default value of 4.
◼
As an example, if the frequency data in train starts in column 5 and the label data is in column 4, you would write: splitXY[train, 5, -1 , 4].
This function returns a list of entries, where each item is of the form x (list of inputs) y (output value). This is one of the many formats that Mathematica can use for machine learning tasks.
Plotting Spectra
Before continuing, let’s look at the spectra of a few molecules to see what they look like. We will use the splitXY function but i general we will not look at the lists of data that it returns directly, and so it is useful to suppress the output by putting a semicolon at the end of the line. You can choose which spectra to plot by changing the index values contained in the exampleIndices list below:
In practice, different IR spectra may be recorded at different molecular concentrations, so the absolute intensities may not be directly comparable. Therefore we will normalize the data before carrying out the analysis.
We will apply a type of normalization called min-max normalization to each "instance" (i.e., molecule) and update the data.
◼
For each molecule, the spectral intensities will be scaled to range from 0 to 1.
◼
We will use the Rescale function to perform this process.
We will define a function called normalize to carry out this normalization process. Specifically, we will define the function, normalize to act on each entry in list. Then we will define this function as “Listable” meaning that it will apply to each entry in a list.
We expect that intensities near 0 won’t provide much useful information for the classification. Therefore we will choose a threshold intensity and set all intensity values below the threshold equal to 0.
We will define a function called applyThreshold to apply the threshold chosen above to the training and test data; conveniently we can make use of the built-in Threshold function; the only trick is to apply it only to our “inputs” and keep our “outputs” the same:
The three classes (ketone, carboxylic acid, and other carbonyl group) are imbalanced in the training dataset. Thus we need to balance the data before training a machine learning model. We will use the synthetic minority oversampling technique (SMOTE) for this data balancing step. Because there are three sets of training data, we will need to use SMOTE on each one separately.
We will use the Random Forest machine learning algorithm for this multiclass classification task. The first step is model training using the training data. Instead of training a single model, we will train three different models: one for the ketone training data, one for the carboxylic acid training data, and one for the other carbonyl training data. We will train three additional models using the corresponding balanced training data. We will store the trained models in an Association:
Now we will use the trained machine learning models for label prediction on the test dataset. We will use each of the six trained models on the same test input data. We will store the predicted labels in the Association predictions, with the keys describing each of the models
We can also obtain the probability of a given label (i.e., how confident the model is in the prediction). Let’s take a look at a quick example. Suppose that we wanted to apply our classifier to a few examples from our ketone training set, trainKetone. As we saw previously we can apply the classifier function, stored in the models Association, with the key “ketone”, and use it to generate the predicted label:
We can also provide the classifier function with an optional second argument, which will report the probabilities for each type of class, in the form of an Association:
Alternatively, if we are just interested in the probability that it is “1” (is a ketone”) then we can use the “Probability”outcome label form to only report those outcomes:
In[]:=
models["ketone"][examples,"Probability"1]
Out[]=
{0.0725198,0.0827249}
Having seen how the pieces work, let’s evaluate the probabilities for each of the individual models, storing the results in an association, probabilities, whose keys describe which model is used to make the prediction:
Now that we have the probabilities of each label prediction, we can determine the most probable label as the one with the highest probability. For example, if the probabilities are:
◼
Ketone: 0.20
◼
Carboxylic acid: 0.79
◼
Other carbonyl: 0.15
...then the final prediction would be carboxylic acid. (Note that the probabilities do not have to sum to 1 here, as they are probabilities for each independent yes-no prediction.)
Run the code block below to determine the most probable prediction and display a Dataset comparing the probabilities, the prediction, and the actual label (i.e., the “truth”); each of the categories is assigned an integer code 1=ketone, 2 = carboxylic acid, 3=other:
Although there is only one way for a prediction to be correct, there are multiple ways for a prediction to be incorrect here because there are multiple labels. Run the code block below to generate a table showing the different combinations of correct and incorrect predictions; the first column indicates the predicted label, the second column shows the true label, and the final column is the number of occurrences:
This data can be arranged in the format of a confusion matrix like the ones from the first lab, where the rows are the predictions and the columns are the true values. (But to create more confusion, sometimes these matrices are presented with the rows and columns swapped.):
To determine the overall performance of the model, we can calculate an overall accuracy score, equal to the total number of correct predictions divided by the total number of predictions. Run the code blocks below to determine and display the overall accuracy:
We can also determine accuracy scores for each of the three classes. For example, the ketone class accuracy score would be the total number of correct ketone predictions divided by the total number of true ketones in the test data. We do this by grouping the results by the truth labels and then computing the accuracy on each subset. Run the code blocks below to determine and display the accuracy for each (true) class:
The analysis in this notebook can be modified or extended in a number of different ways:
◼
Changing data preprocessing or machine learning analysis parameters
◼
Using another machine learning classification algorithms (such as k-Nearest Neighbors or Naive Bayes) instead of Random Forest
◼
Extending the analysis to other carbonyl-containing functional groups (such as esters or aldehydes)
◼
Using the same multiclass classification framework to distinguish between the IR spectra of carbonyls and other functional groups (such as alcohols, amines, alkenes, etc)
◼
Using the trained machine learning model for functional group prediction in experimental or computational IR spectra datasets
Acknowledgments
We thank Mark Hendricks and Kedan He for testing this activity and providing helpful feedback. We also thank the physical chemistry laboratory students at Fordham University and Whitman College for their participation.