A deep position-encoding model for predicting olfactory perception from molecular structures and electrostatics

In this section, we first describe the dataset used in this work. Then, we introduce Coulomb-GCN, which updates GCN by replacing the adjacency matrix with CM. After the effectiveness of Coulomb-GCN is verified, we have Mol-PECO by further replacing the random embedding of atoms in Coulomb-GCN with positional encoding, where the spectral information of the CM is employed to have a structure-aware embedding.Dataset: a comprehensive human olfactory perception datasetIn this work, we use the data in which each molecular structure is paired with multiple odor descriptors (Fig. 1a). The dataset in our work is compiled from ten expert-labeled sources: Arctander’s dataset (n = 3102)24, AromaDb (n = 1194)25, FlavorDb (n = 525)26, FlavorNet (n = 718)27, Goodscents (n = 6158)28, Fragrance Ingredient Glossary (n = 1135)29, Leffingwell’s dataset (n = 3523)30, Sharma’s dataset (n = 4006)31, OlfactionBase (n = 5105)32, and Sigma’s Fragrance and Flavor Catalog (n = 871)33. These datasets were retrieved from the archive of https://github.com/pyrfume/pyrfume-data. The data cleaning procedure includes (1) merging overlapped molecules, (2) filtering conflict descriptors, and (3) filtering the rare descriptors assigned to less than 30 molecules. After data cleaning, we obtain a comprehensive dataset of 8503 molecules and 118 odor descriptors.This comprehensive human olfactory perception dataset is multilabeled, with every molecule labeled with one or several odor descriptors. For the number of molecules associated with each odor descriptor, the distribution is imbalanced: each of the 112 odor descriptors is assigned to ≤800 molecules whereas the other 6 descriptors are linked to >800 molecules (Fig. 2a). For the number of descriptors associated with each molecule, the distribution is also skewed, with 8054 molecules possessing ≤5 odor descriptors and 449 molecules possessing >5 odor descriptors (Fig. 2b). For co-occurrence, the descriptors “fruity”, “green”, “sweet”, “floral”, and “woody” co-occur with almost all the descriptors, while odorless molecules co-occur with no other molecules (Fig. 2c). The data split is built by second-order iterative stratification34, which aims at splitting multilabel dataset and preserves the label ratios in each split with an iterative sampling design. The whole dataset is split into train/validation/test datasets of 6802/864/837 pairs, respectively.Fig. 2: The comprehensive human olfactory dataset.a Distribution of molecules across odor descriptors. b Distribution of descriptors across molecules. c Co-occurrence matrix of 118 odor descriptors. The heatmap is demonstrated with logarithm transformation, and the descriptors are ordered alphabetically.Fully connected graph by Coulomb matrix is superior to sparse graph by the adjacency matrixWe calculate CM, which models atomic energies with the internuclear Coulomb repulsion operator21,35, and use it as our molecular representation (Supplementary Notes 1.1 and 1.2). In CM, the diagonal entries refer to a polynomial fit of atomic energies and off-diagonal entries represent the Coulomb repulsion force between atomic nuclei. Although the adjacency matrix has been widely used in molecular modeling36,37, CM as an emerging molecular representation can have at least two advantages: (1) CM handles the oversquashing plight by allowing direct paths between distant nodes in the fully connected graph representation (Fig. 3a); (2) distance by Frobenius norm between CM and adjacency matrix is 5–10 times smaller than that between random initialized matrix and adjacency matrix, indicating that CM is fully connected while preserving a similarity to adjacency matrix (Fig. 3b).Fig. 3: The motivation and workflow of modeling the Coulomb matrix.a An example of the Coulomb matrix as a fully connected graph for Propargyl alcohol, a clear colorless liquid with a geranium-like odor. Molecule’s 3D structure image can be downloaded from PubChem. b Similarity between Coulomb matrix and adjacency matrix, indicated by the distances. The distance is calculated with the Frobenius norm. c The workflow of modeling the Coulomb matrix by graph neural networks.We build a nonlinear map (named Coulomb-GCN) between molecular structures and human olfactory perception (Fig. 3c) by replacing the adjacency matrix in message passing of GCN with CM. Specifically, starting from random atom embedding, the learned atom embedding is obtained by message passing on fully connected molecular graph with neighbor weights specified by the entries of CM. The molecular embedding is extracted by sum-pooling and fed to a multilabel classification module to predict 118 odor descriptors. Taking into account the gap between maximal and minimal entries in CM, normalization of entries may affect performance. We test Minmax and Frobenius normalizations in a matrix-wise manner.We evaluated and compared the prediction accuracy of GCN with the adjacency matrix and those of Coulomb-GCN with the different normalizations of CM (Table 1). Compared to the GCN with adjacency matrix (AUROC of 0.678), gains in AUROC are observed in Coulomb-GCN with Frobenius normalization (AUROC of 0.759) and min–max normalization (AUROC of 0.713). Coulomb-GCN with Frobenius normalization also achieves higher performances in five out of six evaluation metrics (Table 1): AUROC (improved from 0.678 to 0.759), AUPRC (improved from 0.111 to 0.143), specificity (improved from 0.625 to 0.744), precision (improved from 0.079 to 0.089), and accuracy (improved from 0.726 to 0.780).Table 1 Prediction performances of Coulomb matrix with Minmax and Frobenius normalizations and adjacency matrix in GCNPositional encoding by Laplacian eigenfunctions improves prediction accuracyThe graph Laplacian and its spectral information enable us to characterize the global and substructures of graphs38,39,40. Specifically, the graph Laplacian is defined as L = D − A, where D and A refer to the degree and adjacency matrices. L is positive semi-definite with one trivial and the other nontrivial eigenvalues. In this work, the Laplacian defined by CM acts as an extension of the normal Laplacian (L = D − X), where X refers to the weighted matrix (CM in this work) and possesses the same properties as the graph Laplacian (e.g., symmetric and positive semi-definite). In particular, the eigenvectors of L provide an optimal solution to the Laplacian quadratic form (\({f}^{T}Lf=1/2{\sum }_{i,j}X(i,j){({f}_{i}-{f}_{j})}^{2}\))38,39,40, encoding the geometric information of graphs.Given these properties of the Laplacian graph, we used the Laplacian eigenfunctions of the CM to encode the positional information of molecular graphs. Typical results of a cyclic odorant (5-pentyloxolan-2-one, flowing from left to right in λ1) and acyclic odorants (hexyl 3-methylbutanoate, flowing from left to right in λ1, and heptyl pentanoate, flowing from right to left in λ1) demonstrate the information carried by low-frequency eigenfunctions (Fig. 4a). Combining it with the Coulomb-GCN, we construct the deep learning framework, named Mol-PECO (Fig. 4b), with the fully connected molecular representation by CM and the positional encoding by Laplacian. We introduce a learned positional encoding (LPE) by Transformer17 to build the atom embedding (Supplementary Note 2.1). Specifically, LPE starts with concatenating the p lowest eigenvalues and the corresponding eigenvectors as the input matrix Λ ∈ Rp×2, and learns the encoding with Transformer for every atom17. We obtain AUROC of 0.796 and AUPRC of 0.153 with LPE of raw CM. We further perform the experiments for LPE of asymmetric normalized CM and obtain an additional gain of performances by 0.017 and 0.028 for AUROC and AUPRC, respectively.Fig. 4: The motivation and architecture of Mol-PECO.a Structural information carried by the Laplace spectrum of the Coulomb matrix. Low-frequency eigenvectors, calculated with graph Laplacian, as the input matrix for positional encoding and three examples, including cyclic and acyclic molecules, of eigenvalue λi and eigenvector ϕi for molecular graphs (i ∈ {1, 2, 3}). The color indicates the value of each component (node) of the eigenvectors. b The architecture of Mol-PECO. Mol-PECO learns the positional encoding (LPE) with Transformer of graph Laplacian, and updates the atom embedding with GCN of Coulomb matrix and LPE. Specifically, GCN is implemented with skip-connection to relieve oversmoothing. Coulomb matrix, the fully connected graph representation, suppresses oversquashing with direct connections between nodes. With the updated atom embedding, Mol-PECO extracts the molecular embedding with sum-pooling, and predicts 118 odor descriptors with neural networks of molecular embedding. In this work, p and d are set as 20 and 32, respectively.We compare Mol-PECO with the baseline models (Table 2): the conventional GCN of graph representations, including the adjacency matrix and the CM, and the classifiers of fingerprint representations, including Mordreds features (mordreds)9, bit-based fingerprints (bfps)8, and count-based fingerprints (cfps)8. Conventional classifiers include k-Nearest Neighbor (KNN), random forest (RF), and gradient boosting (GB). In the fingerprint methods, we first handle the problem of imbalanced label distribution with Synthetic Minority Over-sampling Technique (SMOTE)41, and then perform the classification. Mol-PECO outperforms the baselines in three out of six evaluation metrics (Table 2), with AUROC improved from 0.761 (cfps-KNN) to 0.813, AUPRC improved from 0.144 (mordreds-RF) to 0.181, and accuracy improved from 0.780 (Coulomb-GCN) to 0.808. In particular, Mol-PECO can balance AUROC (0.813) and AUPRC (0.181) whereas the ML method (cfps-KNN) with the highest AUROC of 0.761 has a low AUPRC of 0.057 and one (cfps-RF) with the highest AUPRC of 0.144 shows a low AUROC of 0.723. Moreover, Mol-PECO also shows superior performances of AUROC and AUPRC, compared with a recently published graph convolution model (Supplementary Note 2.4). Thus, Mol-PECO boosts the predictability of QSOR.Table 2 Prediction performances by conventional classify with ML methods, GCN with adjacency and Coulomb matrices, and Mol-PECOPrediction accuracy and predictability of individual descriptorsWe scrutinize the prediction obtained by Mol-PECO by computing the AUROC and AUPRC for each descriptor and contrasting the results with other GCN models (Supplementary Note 2.2). Figure 5a shows the scores of individual descriptor obtained by Mol-PECO. Descriptors exhibiting high performance, as demarcated by the yellow region, predominantly include common descriptors such as “odorless”, “fruity”, “floral”, “green”, and “woody”, with the exception of “alcoholic”.Fig. 5: The scores of individual descriptors obtained by Mol-PECO, adjacency-GCN, and Coulomb-GCN.a The scores of individual descriptors obtained by Mol-PECO. Each point corresponds to the score (AUPRC, AUROC) for each descriptor. The diameter of points is proportional to the square root of the frequency of the descriptor in the dataset. The yellow region indicates highly predictable descriptors, whereas the red region accommodates descriptors that are difficult to predict from chemical structures by Mol-PECO. b, c Comparisons of scores obtained by Mol-PECO (blue points) with those by (b) adjacency-GCN (purple points) and by (c) Coulomb-GCN. Each arrow links the score of Mol-PECO with that of adjacency-GCN or Coulomb-GCN for each descriptor. Arrows are colored in red if both AUPRC and AUROC of Mol-PECO are lower than those of adjacency-GCN or Coulomb-GCN.Mol-PECO’s average performance enhancement is ascribed to its capacity to augment the scores across a broad spectrum of infrequent descriptors, thereby demonstrating its versatility. This is in stark contrast to adjacency-GCN, which is limited to predicting only common descriptors (Fig. 5b). The limitations of adjacency-GCN are somewhat ameliorated in Coulomb-GCN (Fig. 5c), thereby supporting the hypothesis that oversquishing hampers efficient training of conventional GCNs. Additionally, the observed decrease in Mol-PECO’s performance for common descriptors underscores a trade-off of prediction accuracy between a limited set of frequent descriptors and a diverse array of infrequent ones.Moreover, Mol-PECO identifies descriptors challenging to predict only from chemical structures, as illustrated by the red region in Fig. 5a. These descriptors are categorizable into three classes: (1) Descriptors pertaining to senses other than olfaction, such as “sweet” (taste), “creamy” (touch), “metallic” (touch, temperature); (2) Conceptual and polysemous descriptors including “bland”, “fresh”, “earthy”, “aromatic”; (3) Categorically complex descriptors like “fishy”, “musty”, “herbaceous”, “herbal”, “cortex”.The first class of descriptors may be perceived as olfactory properties due to the associative learning of olfaction with other sensory modalities. Consequently, predicting these descriptors solely from chemical structures poses significant difficulties or may even be impossible. The second category encompasses conceptual descriptors such as “bland”, indicative of potentially combinatorial attributes in odorant mixtures. The third category, while more specific than the second, pertains to descriptors delineating complex entities like fish, mold, and skin. Given the abstract nature of the second and third categories, their accurate prediction may necessitate extensive datasets, akin to the approach employed in large language model training.External test with DREAM datasetTo further investigate Mol-PECO’s generalization ability, we further assess the trained Mol-PECO’s performances in an external test dataset, DREAM dataset. The DREAM project measures 49 volunteers’ perception score (from 0 to 100) of molecules in 2 dilutions (low and high concentration). After transforming the regression annotations into binary labels (Supplementary Note 2.3), we obtain the final DREAM test set: 13 molecules with 2 odor descriptors in the low concentration set and 31 molecules with 3 odor descriptors in the high concentration.We assess Mol-PECO with the DREAM dataset in high/low concentration independently. For low concentration, Mol-PECO achieves perfect performances (AUROC/AUPRC of 1.00/1.00), and all “sweet”/”garlic” molecules are correctly distinguished from the rest (Table 3). For high concentration, the “burnt” molecule has been correctly predicted with AUROC of 1.00 (Table 3), while “fruity” and “sweet” molecules achieve AUROC of 0.88 and 0.60 (Table 3), respectively.Table 3 Performances of Mol-PECO in the external DREAM datasetLearned odor space by Mol-PECOTo elucidate the structural organization of multiple odors in relation to descriptors, we conducted a dimensionality reduction for the output of Mol-PECO’s penultimate layer to construct a latent odor space, which was then evaluated at both global and local scales. Globally, the analysis focused on assessing the extent to which clusters of odors within this latent space encapsulate descriptor information. Locally, the investigation centered on determining whether individual molecules are characterized by a set of odor descriptors that are similar to those of proximal molecules.For global structure, Fig. 6a, b illustrates the distribution patterns of the frequent descriptors, “fruity”, “floral”, “green”, “woody”, which are among the top six most frequently assigned descriptors in the dataset. Molecules with these descriptors exhibit dispersed distributions across the odor space, yet distinct regions can be identified as the overlapping of some descriptors. On the periphery of the space, more defined clusters are formed by specific classes of descriptors, such as “odorless”, “fatty”, “sulfurous”, “ethereal”, and “musk”, as shown in Fig. 6c and d (Supplementary Note 2.5 for the descriptors not shown in Fig. 6d). These descriptors are associated with high scores by Mol-PECO.Fig. 6: The odor space built from Mol-PECO and its global property obtained by dimensionality reduction with t-SNE.Each gray dot corresponds to an individual molecule. a The global distribution of molecules, highlighting those associated with the representative common descriptors, “fruity”, “green”, “floral”, and “woody”. b The distribution of molecules corresponding to each descriptor in (a); the molecules to which the descriptor is assigned are marked in red, and those to which it is not assigned are marked in gray. c The global distributions of molecules associated with the descriptors that form distinct clusters. d The distribution of molecules corresponding to a selected descriptor representative of each cluster identified in (c). e The distributions of molecules associated with descriptors with low prediction scores by Mol-PECO: “sweet”, “musty”, “herbaceous”, and “metallic”.Moreover, in the context of structure-correlated descriptors, molecules linked with “alliaceous”, “garlic”, or “onion” congregate within the “sulfurous” cluster (Fig. 6c). This cluster aligns with prior research on the olfactory characteristics of sulfur-containing compounds42,43. In the case of synonymy descriptors, molecules associated with “oily” and “waxy” are observed to cluster with “fatty”, corroborating their semantic similarity (Fig. 6c). These clustering patterns visually substantiate the efficient representation of odorant molecules within the learned odor space by Mol-PECO, which was quantified by its enhanced performance metrics in Fig. 5.Conversely, molecules associated with lower-scored descriptors such as “sweet”, “musty”, “herbaceous”, and “metallic” are distributed across the space without discernible pattern (Fig. 6e). This uniform distribution underscores the difficulty in representing these descriptors within the learned space. Finally, it should be noted that the dimensionality reduction of the odor space to two dimensions via principal component analysis (PCA) does not reveal specific patterns or structures. This observation means that the learned odor space possesses a high-dimensional and nonlinear structure to accommodate molecules with a wide array of structural characteristics. Thus, the two-dimensional projection by t-SNE delineates only the limited aspects of the learned odor space.For local structure, we investigate one odorless molecule (triphenylphosphane) and one odorant molecule (1-(1-sulfanylpropylsulfanyl)propane-1-thiol with descriptors of “alliaceous”, “fruity”, “garlic”, “green”, “onion”, and “sulfurous”) as the examples (Fig. 7a, b). We compare the top-5 nearest molecules searched by Mol-PECO’s embedding and bfps. The top-5 molecules of Mol-PECO and bfps are calculated with cosine similarity and Tanimoto similarity, respectively. For the odorless molecule, Mol-PECO retrieves all neighbors with the odorless descriptor, whereas bfps retrieves no molecule. Notably, the molecules fetched by Mol-PECO possess different substructures compared with the reference (e.g., all with C-Cl bond and top-2/3/5 with carbonyl functional group). For the odorant molecule, Mol-PECO retrieves all neighbors with shared descriptor, and bfps retrieves four. Moreover, all of the molecules retrieved by bfps are open-chain structured, the same with the reference. In contrast, Mol-PECO retrieves quite different structured molecules, four of which are cyclic molecules. Both examples would indicate Mol-PECO’s promising potential in decoding molecules with different structures but identical smells. Nevertheless, such generalization capacity of structural features within the current framework still has room for improvement. This limitation is illustrated in the categorization of molecules possessing a musk-like aroma. This category encompasses molecules that possess disparate functional groups yet share an identical smell. For musk-scent molecules, Mol-PECO demonstrates a propensity to assign high scores to macrocyclic musks, whereas it exhibits a deficiency in identifying nitro-musk compounds as musk-scented molecules (Supplementary Note 2.6). This discrepancy underscores a critical area for improvement in the algorithm’s generalization ability across varied structural features while maintaining accurate scent classification.Fig. 7: Local view of the learned odor space investigated with nearest neighbor retrieval.a Retrieved molecules for an odorless molecule by Mol-PECO and the best fingerprints method. b Retrieved molecules for an odorant molecule by Mol-PECO and the best fingerprints method.

Hot Topics

Related Articles