Predicting odor from vibrational spectra: a data-driven approach

This section starts with describing the datasets and featurizing them, following that, classification, clustering, and saliency analysis methods are described.DatasetThe characterization of odorants involves the consideration of various qualitative descriptors, including hedonic attributes. Of the various odorant datasets with their odor labels available, we used a combined dataset from the training materials provided by Firmenich during the “Learning to Smell Challenge” with the Leffingwell PMP 200120. This combined dataset has 7,374 molecules with significant structural diversity and 109 distinct odor classes. We also used VS of odorant molecules, as receptors detect vibrational energy reflected in VS by the mechanism of IETS, as mentioned previously.Integrated dataset (IGD)Saini et al. compiled a dataset by combining two distinct sets of expertly labeled odor datasets20. These sets originated from the training materials provided by Firmenich during the “Learning to Smell Challenge” and Leffingwell PMP 2001. The resulting dataset, after merging and preprocessing, comprised 7374 molecules distributed across 109 unique odor classes, each with different sample counts, as illustrated in Fig. 1. The complete list of all 7374 odorant molecules can be found in Supplementary file 1.Subset of integrated dataset (Subset-IGD)The CAS Registry number or International Chemical Identifier (InChI) Code was generated for all IGD molecules by utilizing their SMILE code21,22. The conversion from SMILE to CAS or InChI was done using the NistChemPy Python library23. Subsequently, the Vibrational Spectra (VS) corresponding to the obtained CAS and InChI were acquired from the Chemistry Webbook hosted by the National Institute of Standards and Technology (NIST)24. Vibrational frequencies and intensities of all these molecules were determined with the B3LYP/6-31G* level of theory, a type of density functional theory (DFT), after optimizing the molecular structures for minimum energy. Gaussian 09 suite25 was used for DFT calculations. Although IGD had 7374 molecules, the molecules without VS were eliminated and 3018 molecules were left in Subset-IGD having 109 unique odors. Consequently, Subset-IGD emerged as a representative sub-dataset derived from IGD. Figure 1 shows the percentage of samples associated with odor classes in IGD and Subset-IGD, visually indicating that Subset-IGD is a representative sub-dataset of IGD. This is confirmed by the Kolmogorov-Smirnov (K-S) test26, which yielded a test statistic (D) of 0.0734 and a p-value of 0.93, signifying Subset-IGD to be a representative sub-dataset of IGD. All 3018 odorant molecules are listed in Supplementary file 1. Fig. 1There is a significant imbalance in associated samples of odor labels, with the “fruity” label appearing around 27% of the time, whereas the occurrence of “fennel” is less than 1% in the dataset.Featurising moleculesWe featurized molecules and their VS to get a meaningful numerical representation for training our classification models. Molecules were featurized using traditional daylight fingerprinting as they gave the best result on the used dataset in the previous studies. We experimented with the Morgan fingerprint, too, also known as the extended-connectivity fingerprint (ECFP4) having bit length of 2048 bits and the fragment radius was set to 2 (diameter 4)27, RDKit was used to generate these features but they performed poorly compared to Daylight fingerprints28; classification results of Morgan Fingerprint are detailed in supplementary file 5. Gaussian smoothing was performed on VS, followed by downsampling to get features. Smoothing broadens peaks in VS, and downsampling reduces the input dimension to the model, improving model performance29. Three channel images of VS were generated using GASF, GADF and MTF transformation; this allows the utilisation of advanced pattern recognition techniques, improving classification accuracy30. Hence, we had three types of features with which we carried out further analysis: Daylight Fingerprint Features (DFF), Gaussian Smoothed Dimensionally reduced features (GS_VS), Images presentation of VS and ResNet50 features (VS_IMG).Gaussian smoothed dimensionally reduced features for VS (GS_VS)For each spectrum, a set of wavenumbers (vwns) associated with a specific molecule is projected onto a linear bounded frequency scale (BFS), usually ranging from 1 to 4000 cm−1. This range is chosen to encompass all essential vibrational normal modes for subsequent analysis. Then, Gaussian smoothing is applied to the spectra. The BFS is normalized and sampled at fixed increments of L cm−1 for dimensionality reduction. The resulting spectra are utilized as features for the deep learning model, referred to henceforth as GS_VS29. The equation for a Gaussian kernel is:$$\begin{aligned} K_{\text {gaussian}} = \frac{1}{\sigma \sqrt{2\pi }} \exp \left( \frac{(x – \bar{x})^2}{2\sigma ^2}\right) \end{aligned}$$
(1)
A sigma value of 10 cm−1 and L set to 5 cm−1 were employed, resulting 800 (4000/L) descriptor variables. The smoothing procedure applies a smearing function that broadens the vibrational peaks, which allows for comparing different compounds with slightly different frequencies. Figure 2 shows an overview of this process. Turner et al. introduced EVA descriptor for spectra, which has been validated for use in QSAR studies31. Proposed GS_VS performs equally well as the EVA descriptor, but when 3-channel images are generated for the EVA descriptor, the model’s performance on them is significantly reduced. A possible reason for this poor performance could be that the intensity of peaks is not considered while generating them. Hence, we use GS_VS for our research.Fingerprint-based features (DFF)To generate features for our deep learning model, we used a traditional featurization technique: daylight fingerprinting using the Rdkit. These Daylight fingerprint features (DFF) were generated from the SMILE representation of the molecules. These 1024 binary fingerprints capture the presence or absence of specific chemical substructures within a molecule.Fig. 2Overview of feature generation from VS for molecule 2-methyl-phenol (shown in inset). (a) Projected VS onto a BFS (b) spectra obtained after Gaussian smoothing (c) normalised spectra with BFS sampled at 5 cm−1. Obtained spectra in (c) is further processed to generate image features using GAF and MTF (d1) polar coordinate mapping (d2) Markov transition matrix (e1) Gramian angular summation field (GASF) a 224X224X1 image (e2) Gramian angular difference field (GADF) a 224X224X1 image (e3) Markov transition field (MTF) a 224X224X1 image (f) GASF, GADF and MTF are overlayed to obtain a 224X224X3 image for each VS (g) pre-trained ResNet50 model is used to get flattened feature VS_IMG.Images presentation of VS and ResNet50 features (VS_IMG)The obtained GS_VS is encoded into images using two frameworks, Markov Transition Field (MTF) and Gramian Angular field (GAF)30. These methods aid in transforming the time series into an image format, enhancing the classification process. Three images Gramian angular summation field (GASF) a 224 × 224 × 1 image, Gramian angular difference field (GADF) a 224 × 224 × 1 image, and Markov transition field (MTF) a 224X224X1 image obtained from GS_VS are overlayed to obtain a 224X224X3 image for each VS. We use pre-trained ResNet50 network (on ImageNet) to obtain the flattened features (VS_IMG) for these 3-channel images32. This is achieved with the help of both Python/keras.applications.resnet50 and Python/keras models. Figure 2 shows an overview of the process of converting GS_VS to VS_IMG.Gramian angular field (GAF)To create a 2-D image, the first step is to transform the normalized time series data into polar coordinates. While mapping to the polar coordinate system, the normalized time series value is represented as the angular cosine, and the time stamps are illustrated as the radius in the polar coordinate system. For a time series \(Y = \{y_1, y_2, \ldots , y_n\}\), the angle (\( \theta _i \)) and radius (\( r_i \)) are computed using equations 2 and 3:$$\begin{aligned} \theta _{\text {i}} = \cos ^{-1}(\overline{Y}_{\text {i}}); \quad \text {where},\ 0 \le \theta _i \le \frac{\pi }{2} \end{aligned}$$
(2)
$$\begin{aligned} r_{\text {i}} = \frac{t_{\text {i}}}{N} \end{aligned}$$
(3)
Where \(\overline{Y}_{\text {i}}\) is normalized time series, \( N \) and \( t_i \) are the constant value and time stamp, respectively, that divide the polar coordinate’s time span into equal portions. When transitioning a time series from the Cartesian coordinate system to the polar coordinate system, the GAF transformation exhibits the property of bijectiveness, generating a unique value in the polar system and a distinct inverse value. Unlike the Cartesian coordinate system, the absolute temporal relationship is preserved within the polar coordinate system. Consequently, a Gramian matrix is constructed, where each matrix element corresponds to the trigonometric function of the sum or difference across different time intervals. By employing the following equations, GASF and GADF can be derived:$$\begin{aligned} GASF = \left[ \begin{array}{ccc} \cos (\theta _{1}+\theta _{1}) &{} \cdots &{} \cos (\theta _{1}+\theta _{n}) \\ \vdots &{} \ddots &{} \vdots \\ \cos (\theta _{n}+\theta _{1}) &{} \cdots &{} \cos (\theta _{n}+\theta _{n}) \end{array} \right] \\ GADF = \left[ \begin{array}{ccc} \sin (\theta _{1}-\theta _{1}) &{} \cdots &{} \sin (\theta _{1}-\theta _{n}) \\ \vdots &{} \ddots &{} \vdots \\ \sin (\theta _{n}-\theta _{1}) &{} \cdots &{} \sin (\theta _{n}-\theta _{n}) \end{array} \right] \end{aligned}$$Markov transition field (MTF)Dividing a data series \(Y = \{y_1, y_2, \ldots , y_n\}\) evenly into Q bins, each \(y_i\) is associated with a bin \(q_j\) (\(j \in [1, Q]\)). Hence, by using a first-order Markov chain, a Markov matrix Z of size \(Q \times Q\) can be built. It makes the following expression: \(z_{\text {ij}}=p\{ y_{\text {t}}\in q_{\text {i}}|y_{\text {t-1}} \in q_{\text {j}}\}\)$$\begin{aligned} Z = \left[ \begin{array}{ccc} z_{\text {11}} &{} z_{\text {12}} \cdots &{} z_{\text {1Q}}\\ z_{\text {21}} &{} z_{\text {22}} \cdots &{} z_{\text {2Q}}\\ \vdots &{} \ddots &{} \vdots \\ z_{\text {Q1}} &{} z_{\text {Q2}} \cdots &{} z_{\text {QQ}}\\ \end{array} \right] \\ \end{aligned}$$The likelihood of the point y, presently situated in bin \(q_i\), transitioning to bin \(q_j\) in the subsequent time is represented by \(z_{ij}\). Evidently, \(z_{ij}\) fulfils the relation: \(\sum _{j=1}^{Q} z_{ij} = 1\). Due to the memorylessness property of the Markov chain, the current elements in the matrix Z are independent of the previous element, leading to a notable loss of pertinent data. To address this constraint, the matrix Z is expanded to the Markov transform field N by introducing a time axis. The data point at time stamps i and j is associated with respective bins: \(q_i\) and \(q_j\), following the division of the dataset into Q bins along the time axis. The transition probability from \(q_i\) to \(q_j\) is denoted by \(N_{ij}\) and expressed as follows:$$\begin{aligned} N = \left[ \begin{array}{ccc} z_{\text {ij}} | x_{\text {1}} \in q_{\text {i}},x_{\text {1}} \in q_{\text {j}} &{} \cdots &{} z_{\text {ij}} | x_{\text {1}} \in q_{\text {i}},x_{\text {n}} \in q_{\text {j}}\\ z_{\text {ij}} | x_{\text {2}} \in q_{\text {i}},x_{\text {1}} \in q_{\text {j}} &{} \cdots &{} z_{\text {ij}} | x_{\text {2}} \in q_{\text {i}},x_{\text {n}} \in q_{\text {j}}\\ \vdots &{} \ddots &{} \vdots \\ z_{\text {ij}} | x_{\text {n}} \in q_{\text {i}},x_{\text {1}} \in q_{\text {j}} &{} \cdots &{} z_{\text {ij}} | x_{\text {n}} \in q_{\text {i}},x_{\text {n}} \in q_{\text {j}} \end{array} \right] \end{aligned}$$Data analysisTo perform a systematic analysis of VS both clustering and classification were performed, and also saliency analysis of the model was done to explain model predictions.Classification modelsThe histogram plot in Fig. 1 indicates that fruity is the highest occurring label, and fennel/cedar has the lowest occurrence in our dataset. As apparent, the distribution has a heavy class imbalance and is highly skewed.To ensure a general representation of data in both the training and test datasets, we opted for iterative stratified sampling instead of random sampling for the dataset split. This method was executed using the iterative_train_test_split class available in the scikit-multilearn library33. Choosing random sampling unintentionally led to the omission of minority odor descriptors, worsening the pre-existing label imbalance. Visualization in Fig. 3a and b illustrates the data split into test and training sets using random and iterative sampling, respectively. The line charts present a comparative analysis between random and stratified splits, where the non-overlap of the orange and blue line chart indicates the relative occurrence disparity of a specific label in the training and test sets. The findings highlight that stratified sampling yields a better representative test and training set split.Fig. 3Data split into test and training sets using random and iterative sampling.Given the considerable class imbalance in our dataset, we employed a cost-sensitive multilayer perceptron (CSMLP) for this study. The conventional multilayer perceptron (MLP) trained through the backpropagation of error algorithm assumes equal misclassification costs (false negative and false positive). However, recognizing that a false negative is more detrimental or costly than a false positive34, we introduced the CSMLP to address this issue. In the proposed CSMLP, we adapted the MLP’s loss function to accommodate class imbalance. To enhance the penalty for mistakes on minority class samples, we utilized a combination of weighted binary cross entropy (WBCEL) and focal loss. In WBCEL, assigned weights are proportional to the \(\delta \) power of the inverse of class frequencies35,36,37, thereby imposing a higher penalty for errors on the minority class. Focal loss assigns a weight to each sample based on its difficulty, which is measured in terms of the loss incurred by the CSMLP on that particular sample38. Samples with higher losses are regarded as more challenging. The equations for Focal Loss (FL), Weighted Binary Cross-Entropy Loss (WBCEL), and Combined Loss (CL) are as follows:$$\begin{aligned}{} & {} \text {FL} = -\frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{C}((1 – y_{\text {pred,ij}})^\gamma \cdot y_{\text {true,ij}}\cdot \log (y_{\text {pred,ij}}+\varepsilon )+(y_{\text {pred,ij}})^\gamma \cdot (1 – y_{\text {true,ij}})\cdot \log (1 – y_{\text {pred,ij}}+\varepsilon )) \end{aligned}$$
(4)
$$\begin{aligned}{} & {} \text {WBCEL} = -\frac{1}{N}\sum _{i=1}^{N}\sum _{j=1}^{C}(w_{\text {pos,j}}^\delta \cdot y_{\text {true,ij}}\cdot \log (y_{\text {pred,ij}}+\varepsilon ) + w_{\text {neg,j}}^\delta \cdot (1 – y_{\text {true,ij}})\cdot \log (1 – y_{\text {pred,ij}}+\varepsilon )) \end{aligned}$$
(5)
$$\begin{aligned}{} & {} CL = WBCEL + \lambda *FL \end{aligned}$$
(6)
\(\varvec{\lambda }\) is the hyper-parameter that controls the importance of each loss.\(\varvec{\gamma }\) is focusing paramerter in FL and \(\delta \) is used for smoothing the effect of class weight.C is number of classes and N is number of samples in a batch. \({\textbf {y}}_{{{\textbf {true,ij}}}}\) true label for jth class of ith example.\({\textbf {y}}_{{{\textbf {pred,ij}}}}\) predicted probability for jth class of ith example.\(\varvec{\varepsilon }\) is a small constant to avoid numerical instability.\({\textbf {w}}_{{\textbf {pos,j}}}\) = (Number of -ve labels for class j)/(Total number of samples in dataset)\({\textbf {w}}_{{\textbf {neg,j}}}\) = (Number of +ve labels for class j)/(Total number of samples in dataset)We first trained unimodal baseline models on PCA-reduced VS_IMG, DFF of Subset-IGD and IGD, and GS_VS features alone. We then developed multimodal fusion model that learn jointly from PCA-reduced VS_IMG and DFF. Figure 4 shows an overview of the fusion model.All these models are CSMLP with fully connected layers, for activation function we have used Relu, and at the final layer, we used sigmoid. Models were trained with Adam39(keras-optimizer), and dropout was used as a regularizer to prevent overfitting. Along with CSMLP, we tried random resampling techniques: Multi-Label Random Under-Sampling (ML-RUS) and Multi-Label Random Over-Sampling (ML-ROS). However, we did not observe any significant improvement in results. We leave the detailed architecture of all models in the supplementary file 2 and results of resampling in the supplementary file 5.Fusion modelFusion model learns from PCA-reduced VS_IMG concatenated by PCA-reduced DFF to produce a final prediction. Figure 4 shows an overview of the fusion model40. We generated 6 different sets of PCA-reduced DFF, with percentage variance ranging from 70 to 95%, and when concatenated with PCA-reduced VS_IMG, we saw an increase in the F1 score with increased variance. Figure 5a shows value of F1 score for different values of variance.Fig. 4Diagram of fusion architecture that learns jointly from VS_IMG and DFF. (a) VS_IMG a 224X224X3 image obtained by overlaying GASF-GADF-MTF (b,c) ResNet50 pre-trained model is then used to extract features from VS_IMG (d) these flattened features are then dimensionally reduced to 3036 using PCA (e) DFF for same molecule obtained from RDKit (f) DFF are dimensionally reduced using PCA (g) PCA reduced features from (d), (f) are concatenated and later classified using CSMLP.Dimensionality reduction and clusteringFeatures obtained from GS_VS and VS_IMG are high dimensional; reducing the dimensionality to 2D helps visualise compounds in a 2D space and observe their distribution. PCA, t-SNE, UMAP and MDS were used for dimensionality reduction, and then clustering was performed using AHC and k-mean clustering.Dimensionality reductionFour-dimensionality reduction techniques were applied on GS_VS and VS_IMG: PCA, t-SNE, UMAP and MDS. All were applied using Python 3.7.6 packages. Principal Component Analysis (PCA) is a multivariate analysis technique designed to extract essential information through an orthogonal transformation41. This transformation generates correlated variables known as principal components, which are new linearly independent variables capturing the key features of the original data. Multidimensional Scaling (MDS) functions as a technique for network localization, revealing the relationships, whether similarity or dissimilarity, among pairs of objects within a given dataset. This information is then translated into distances between points within a two-dimensional space42. The t-SNE method, an evolution of Stochastic Neighbor Embedding (SNE), gauges the similarities between pairs of high-dimensional data objects and their subsequent two-dimensional embedding. Utilizing gradient descent, t-SNE minimizes Kullback-Liebler divergence to generate a two-dimensional embedding43. We followed the protocol suggested by Arora et. al. for better t-SNE visualization44. UMAP, a recent dimension reduction technique, accurately captures non-linear structures in large datasets45. Based on algebraic topology and Riemannian geometry, UMAP constructs a manifold. Our visualization process entails a delicate balance between local and global information, a balance contingent on the careful selection of values for two key parameters: the “minimum distance” and “number of neighbors”. The “number of neighbors” parameter plays a pivotal role in balancing local and global structural aspects, while the “minimum distance” parameter governs the grouping of points within the low-dimensional representation. Following extensive testing and evaluation of various values, we settled on fixed values of 15 for the number of neighbors and 0.1 for the minimum distance.ClusteringAgglomerative Hierarchical Clustering (AHC) and K-means were employed on the reduced dimensions derived from the four preceding techniques. The objective was to categorize similar odor molecules into distinct groups. Previous studies have undertaken clustering analyses based on fingerprint features of odor molecules46, yet none have specifically addressed the clustering of large VS datasets. The clustering process was executed using Python packages on the two-dimensional data to tackle challenges associated with high-dimensional clustering47. For Agglomerative Hierarchical Clustering (AHC), the Euclidean distance matrix (2×2) for each molecule was computed, employing the “ward.D2” aggregative criterion. This criterion aims to minimize the inertia within classes while maximizing the inertia between classes. The clustering process involved successively grouping the two closest classes until a complete clustering tree was obtained. While hierarchical clustering is known for its simplicity and versatility in handling various similarity or distance measures, the irreversible merging of clusters restricts the ability to correct erroneous decisions48,49. In the case of k-means clustering, various numbers of centroids corresponding to the desired cluster count were experimented. Points were assigned to their nearest cluster during each iteration, and the algorithm iteratively refined cluster centroids until no further changes occurred48,50. It’s worth noting that the k-means clustering algorithm is sensitive to outliers and may be less effective when dealing with clusters that deviate from hyper-spherical shapes.To ascertain the most suitable number of clusters, we constructed an “elbow” curve illustrating intra-cluster variability against the number of clusters for each dimensional reduction technique. Following the establishment of clusters using either k-means or AHC, we examined the distribution of odor notes within these clusters. Additionally, we explored the chemical groups or functions of molecules present in different clusters.Feature Importance analysisThe explainability of the classification model is very important for developing vibration-based biomimetic sensors. We used two techniques for feature importance analysis: Permutation feature importance (PFI) and GradCam++51,52. Both are model-agnostic methods and do not require model retraining. PFI can capture variable interaction, too. GradCam++ provides visual explanations of CNN model predictions in terms of better object localization than the state-of-the-art.Permutation feature importance (PFI)For the model trained on GS_VS features we performed feature importance analysis using PFI. Let f is trained/tested uni-modal model on GS_VS and its error measure L(y, f). Then feature importance as the extent to which a feature \(X_{i}\) affects L[y, f(X)], on its own and through its interactions with \(X_{i}\). Permutation importance was first introduced by Breiman (Breiman, 2001)53, the present study uses its model-agnostic version later proposed by Fisher, Rudin, and Dominici (2018)51. Let \(e_{orig}=L(y, f(X)))\) be error with original data, and \(e_{perm}=L(f(y, X_{j,perm})))\) be error with jth feature permuted then feature importance \(FI_j\) is given by equation: \(FI_j = e_{\text {orig}} – e_{\text {perm}}\)GradCAM++For the model trained on VS_IMG, we conducted saliency analysis employing GradCAM++. GradCAM++52 computes the second-order derivative of the predicted probability for the smell category concerning the final convolutional layer:$$\begin{aligned} A_{c}(\textbf{x})=\sum _{i}\sum _{j}\alpha _{ij}^{c}\phi _{i}(\textrm{x};w_{i})\phi _{j}(\textbf{x};w_{j}) \end{aligned}$$
(7)
where \(\phi _i\) and \(\phi _j\) represent the activation maps of the ith and jth feature maps, and \(w_i\) and \(w_j\) denote the weights associated with the ith and jth feature maps. Additionally, \(\alpha _{ij}^c\) represents the weight derived from the global average pooling of gradients across the spatial dimensions.Subsequently, the gradient-weighted class activation maps undergo global average pooling, given by equation 8, Z represents the number of spatial locations in the feature map. Finally, the heatmap is generated, as given in equation 9:$$\begin{aligned} \alpha _{c}^{\text {GradCAM}++}(\textbf{x})=\frac{1}{Z} \sum _{i} \sum _{j} L_{c}^{\text {GradCAM}++}(\textbf{x})_{i j} \end{aligned}$$
(8)
$$\begin{aligned} L_{c}^{\text {GradCAM}++}(\textbf{x})=\text {ReLU}\left( \sum _{k} \alpha _{k}^{c}(\textbf{x}) \phi _{k}(\textbf{x}; w_{k})\right) \end{aligned}$$
(9)
The resultant heatmap in GradCAM++ signifies the regions in VS_IMG that held the utmost significance for the smell category prediction. A higher value at a specific location indicates the increased importance of that location in the classification process.

Hot Topics

Related Articles