Chemical language modeling with structured state space sequence models

Structured state space sequence model (S4)S4s are an extension of discrete state space models, which are widely adopted in control engineering41. Discrete state space models map an input sequence u to an output sequence y, through the learnable parameters \({{{\overline{{{{\boldsymbol{A}}}}}}}}\in {{\mathbb{R}}}^{N\times N}\), \({{{\overline{{{{\boldsymbol{B}}}}}}}}\in {{\mathbb{R}}}^{N\times 1}\), \({{{\overline{{{{\boldsymbol{C}}}}}}}}\in {{\mathbb{R}}}^{1\times N}\), and \({{{\overline{{{{\boldsymbol{D}}}}}}}}\in {{\mathbb{R}}}^{1\times 1}\), as follows:$${x}_{k} = \,{{{\overline{{{{\boldsymbol{A}}}}}}}}{x}_{k-1}+{{{\overline{{{{\boldsymbol{B}}}}}}}}{u}_{k}\\ {y}_{k} = \,{{{\overline{{{{\boldsymbol{C}}}}}}}}{x}_{k}+{{{\overline{{{{\boldsymbol{D}}}}}}}}{u}_{k}.$$
(1)
In other words, discrete state space models define a “linear recurrence”: at any step k, the k-th element of the input sequence uk is fed into the model and used to update the hidden state xk and to generate an output, yk. The matrices \({{{\overline{{{{\boldsymbol{A}}}}}}}},{{{\overline{{{{\boldsymbol{B}}}}}}}},{{{\overline{{{{\boldsymbol{C}}}}}}}}\), and \({{{\overline{{{{\boldsymbol{D}}}}}}}}\) control how the input and the hidden state are combined to provide an output (Fig. 1b).Besides their recurrent formulation, discrete state space models can be formulated as a convolution with the same set of parameters. It can be demonstrated that, by “unrolling” the linear recurrence (Eq. (1)), the output sequence y can be obtained via a learnable convolution over the input sequence u:$$y=u * {{{\overline{{{{\boldsymbol{K}}}}}}}},$$
(2)
where \({{{\overline{{{{\boldsymbol{K}}}}}}}}\) is the convolution filter, parameterized via \({{{\overline{{{{\boldsymbol{A}}}}}}}}\), \({{{\overline{{{{\boldsymbol{B}}}}}}}}\), and \({{{\overline{{{{\boldsymbol{C}}}}}}}}\) (see Supplementary Eqs. (1)–(4) for a detailed derivation). This convolutional representation reveals a key aspect of state space models: they learn explicitly from the entire sequence (via global convolution) while preserving recurrent generation capabilities (Fig. 1b).Learning the optimal parameters of a discrete state space system, however, introduces vanishing gradients and numerical instabilities in recurrent and convolutional formulations, respectively. Structured state space sequence models, (S4s)35, tackle those issues by introducing additional structure to the model parameters (via the so-called high-order polynomial projection operators33) and reducing the unstable computations to the stable Cauchy kernel42 computation (see ref. 35 for more detail). Ablation studies35 have shown the relevance of the added structure to achieve computational feasibility and performance on long sequences. Moreover, such reduction allows S4 to address numerical instabilities encountered in model training and made S4 state-of-the-art in several generative tasks that require learning long-distance relationships33,34,35. Motivated by its performance in other domains and the potential benefits of its dual structure, here we introduce S4 to the molecular sciences for the first time.We evaluated S4 for its ability to learn from and generate drug-like molecules and natural products in an array of tasks, and in terms of multiple molecular properties. LSTMs and Generative Pretrained Transformers (GPTs) were used as benchmarks, since they are the de facto approaches in chemical language modeling for de novo design2,7,8,25. Furthermore, LSTM (recurrent training and generation) and GPT (holistic training and generation) constitute the ideal benchmarks for S4, due to S4’s dual formulation (convolution during training and recurrence during generation), which allows inspecting the effect of each of these aspects on the overall performance. Finally, the prospective de novo design of putative mitogen-activated protein kinase 1 (MAPK1) inhibitors, corroborated by molecular dynamics simulations, was performed to test the potential of S4 in real-world drug discovery scenarios.Designing drug-like moleculesS4 was analyzed for its ability to design drug-like small molecules (SMILES length lower than 100 tokens) extracted from ChEMBL database43, by focusing on its ability to (1) learn the chemical syntax, (2) capture structural features relevant for bioactivity, and (3) designing structurally diverse molecules.Learning the SMILES syntaxAll investigated CLMs were trained on 1.9M canonical SMILES strings extracted from ChEMBL v3143. The generated strings were evaluated according to their (1) validity, i.e., the number (and frequency) of SMILES corresponding to chemically valid molecules; (2) uniqueness, which captures the number (and frequency) of structurally-unique molecules among the designs; and (3) novelty, corresponding to the number (and frequency) of unique and valid designs that are not included in the training set. A high number of “chemically-valid” designs suggests that the model has learned how to generate plausible molecules, while high uniqueness and novelty values indicate little redundancy among the designs and with the training set, respectively. Although these metrics are vulnerable to trivial baselines44, they provide insights into a model’s capacity to learn the SMILES “syntax”.All CLMs generated more than 91% valid, 91% unique and 81% novel molecules (Table 1). Moreover, their designs approximated the training and test sets in terms of selected molecular properties (i.e., octanol-water partition coefficient45, quantitative estimate of drug-likeness46, Bertz complexity47, and synthetic accessibility48,49) with no notable differences among architectures (Supplementary Fig. 1 and Supplementary Table 1). These results agree with the literature on CLMs (e.g., refs. 2,50) and demonstrate the robustness of the model training procedure. S4 designs the most valid, unique, and novel molecules, by generating more novel molecules than the benchmarks (from approximately 4000–12,000 more), and displays a good ability to learn the “chemical syntax” of SMILES strings. The potential of S4 in comparison with existing de novo design approaches was further corroborated on the MOSES benchmark51, where S4 consistently scored among the top-performing deep learning approaches (Supplementary Table 2).Table 1 Designing drug-like molecules de novo with S4To shed additional light on the strengths and limitations of S4 in comparison with the benchmarks, we analyzed the sources of invalid molecule generation for all methods in terms of branching and ring errors, erroneous bond assignment, and other (miscellaneous) syntax issues (Fig. 2). Interestingly, each method seems to show different types of errors leading to SMILES invalidity. LSTM struggles the most with branching, and performs the best with bond assignment, while GPT struggles the most with rings and bond assignment, and has intermediate performance otherwise. S4 struggles more than LSTM with bond assignment, and generates remarkably fewer errors than both benchmarks in branching and ring design. Our hypothesis is that bond assignment indicates good learning of “short-range” dependencies, while branching and ring opening and closure require better capturing of the “long-range” relationships. This suggests that S4 captures relatively “long-distance” relationships well, in agreement with existing evidence in other domains33,34,35.Fig. 2: SMILES design errors, grouped by category and CLM architecture.Each CLM was trained on ChEMBL and used to design 102,400 SMILES strings. The invalid designs were categorized per error, and the reported values indicate the number of errors in each category.Capturing bioactivityWe evaluated S4 for its ability to learn elements of bioactivity. With CLMs this is often achieved with transfer learning52, which allows transferring knowledge acquired from one task to another task with fewer available data. Via transfer learning, after pre-training a CLM on a large corpus of SMILES strings, the model can be then “fine-tuned” on a smaller, and task-focused set (e.g., bioactive molecules) by additional training22. Here, we performed five fine-tuning campaigns, focusing on distinct macromolecular targets from the LIT-PCBA53 dataset: (1) pyruvate kinase muscle isoform 2 (PKM2), (2) mitogen-activated protein kinase 1 (MAPK1), (3) glucocerebrosidase (GBA), (4) mechanistic target of rapamycin (mTORC1), and (5) cellular tumor antigen p53 (TP53).Evaluating the bioactivity of de novo designs (besides synthesis and wet-lab testing) is non-trivial, since this property cannot be fully captured by traditional molecular descriptors, and might not be accurately predicted by quantitative structure-activity relationship models54,55. Hence, we used experimentally-tested molecules to evaluate the capacity of a CLM to learn elements of bioactivity retrospectively. Several studies have shown that the likelihoods learned by a CLM during fine-tuning can be used to prioritize designs with high chances of being bioactive6,56,57. Based on the same principle, here we used the likelihoods learned by the CLMs to rank existing molecules and evaluate their capacity to prioritize bioactive compounds over inactive ones.For each of the selected targets, bioactive molecules (Supplementary Table 3) were used for fine-tuning, with ten random training-validation-test splits. After fine-tuning the CLMs on each target, for each training-test split, we proceeded as follows:

(1)

With each fine-tuned model and per each target, we predicted the likelihoods (Eq. (4)) of the SMILES strings in the respective test set. The considered test sets resemble a real-world scenario in terms of hit-rate, and they comprise 11 (mTORC) to 56 (PKM2) active molecules and 10,240 inactive molecules (except for TP53, containing 3301 inactive molecules, Supplementary Table 3);

(2)

We ranked the molecules of the test set according to the predicted likelihoods (Eq. (5));

(3)

For each target and each test set, we computed the fraction of actives ranked among the top 10, top 50, and top 100 molecules. The higher the number of active molecules ranked in early portions of the test set by a CLM, the better the model has learned what is relevant for bioactivity on the investigated target after fine-tuning.

Our results show variable performance depending on the target (Fig. 3). The most challenging target is TP53, on which no model could consistently retrieve actives among the top 10 scoring molecules. Notably, this target has the most challenging test set, where inactive molecules are similar to the actives of both the training and the test sets (Supplementary Fig. 2), potentially indicating the presence of activity cliffs58. MAPK1 and mTORC1 also challenge the CLMs; here, S4 retrieved more active molecules than the benchmarks, especially in the early portions of the test set. PKM2 and GBA are the easiest datasets; here, all CLMs identified bioactive molecules in their top 10, with S4 achieving the highest median across the board. A Wilcoxon signed-ranked test59 on the pooled scores across datasets supports the superior performance of S4 compared to the benchmarks (p [top 10] = 8.41e−6, p [top 50]= 2.93e−7, p [top 100] = 1.45e−7 compared to LSTM, and p [top 10] = 2.33e−3, p [top 50] = 3.72e−3, p [top 100] = 2.61e−2 compared to GPT), and of GPT compared to LSTM (p [top 10] = 5.22e−3, p [top 50] = 3.75e−5, p [top 100] = 2.02e−6).Fig. 3: Retrospective enrichment analysis for all models across five selected macromolecular targets.The fine-tuned models were used to rank the held-out actives and inactives of the respective protein targets. The percentage of known actives ranked per considered number of test set molecules (10, 50, 100) was computed across ten runs. Bar heights report the median across runs and error bars report the first and third quartiles (n = 10). Source data are provided as a Source Data file.Under the constraints of the study design, these results indicate that processing the input SMILES “holistically” (as GPT and S4 do) leads to capturing complex properties like bioactivity better, with a better performance obtained by S4.Chemical space explorationWe analyzed the ability of S4 to explore the chemical space, in terms of generating structurally diverse and bioactive molecules. To this end, we employed a commonly-used strategy with CLMs, that is, varying the sampling temperature (T) to control chemical diversity60. T affects which elements of a string are generated by a weighted random sampling (Eq. (3)). When T → 0 the most likely element (based on the CLM prediction) is selected as the next element of the sequence, while the higher the T, the more random the selections. T = 1 corresponds to using the CLM predictions as the sampling probability of each element at each generation step.We experimented with an increasing sampling temperatures (from T = 1.0 to T = 2.0 with a step of 0.25). Each T value was used to generate 10,240 SMILES strings per model across the five chosen targets and all training-test splits. Then, we evaluated the designs based on three metrics (Fig. 4):

The validity of the generated strings, which captures how robust the model is to increasing degrees of randomness in preserving a correct syntax. The higher the validity, the better.

Rediscovery rate: de novo design models are often evaluated for their capacity to reproduce existing molecules with experimentally verified biological activities50,55. For this purpose, we used the held-out actives previously described for each target. Moreover, to “relax” the criterion of rediscovery, we considered held-out actives with substructure similarity higher than 60% to a de novo design (as computed via Tanimoto similarity on extended connectivity fingerprints61) to compute rediscovery. Higher rediscovery rates at increased temperature values indicate that the model can explore regions related to bioactivity despite increased randomness.

Scaffold diversity: designing molecules with novel scaffolds bears relevance in lead identification62, and can be used as a proxy to evaluate CLMs51. Here, to have a better evaluation of what constitutes a novel scaffold, the novel designs were grouped in clusters based on their scaffold similarity. This was achieved via hierarchical clustering, to group designs with similar Bemis-Murcko scaffolds63 (as computed via the Tanimoto similarity on the corresponding extended connectivity fingerprints61 higher than 60%). Only novel and unique scaffolds were considered. We then counted the number of obtained scaffold clusters, the higher, the better.

Fig. 4: Model performance when varying the temperature value.Each model was analyzed for its performance when varying the temperature values from 1.0 to 2.0, with a step of 0.25, and by sampling 10,240 molecules. a Analysis of the SMILES validity across temperature. b Variation of rediscovery rate. The models were evaluated for their capability to rediscover bioactive molecules not used for model training or design molecules similar in structure (i.e., with a Tanimoto similarity on extended connectivity fingerprints higher than 60%). c Analysis of the number of diverse groups of scaffolds generated per method. Scaffolds were clustered together if they had a Tanimoto similarity (computed on extended connectivity fingerprints) larger than 60%. For each plot, the solid line indicates the median obtained across the five analyzed protein targets (PKM2, MAPK1, mTORC1, and TP53) with ten runs each (n = 50), and the shaded area indicates the inter-quartile range. The statistics per individual target can be found in Supplementary Fig. 3. Source data are provided as a Source Data file.The models display similar trends with increasing T values for all the analyzed factors across datasets, with varying magnitude (Fig. 4). In general, the validity decreases with increasing temperature (as previously observed60), with the highest effect observed for GPT (median validity across training setups getting lower than 40%, Fig. 4a).Both S4 and LSTM show higher robustness than GPT to increasing temperature values (with LSTM performing slightly better for T ≥ 1.75), suggesting that sequential generation can boost chemical space exploration. S4 outperforms LSTM in terms of rediscovery rate (Fig. 4b), in agreement with our previous results on bioactivity (Fig. 3). We also compute the exact rediscovery rate (identical molecular structure) and observe that no model can consistently generate held-out actives. When it comes to the diversity of the designs (Fig. 4c), LSTM can generate the highest number of structurally unique scaffolds (median across datasets and setups: 6602 clusters, T = 1.75) and S4 is the close second-best model (6520 clusters, T = 1.75). While GPT obtains a suboptimal performance across the board, LSTM seems better for chemical space exploration when bioactivity is not the main objective, while S4 can better capture bioactivity and preserve a good chemical space exploration at the same time, combining the strengths of the two benchmarks with its dual structure. These results confirm the promise of S4 when it comes to generating structurally diverse and bioactive drug-like molecules.Designing natural productsS4 was further tested on more challenging molecular entities than drug-like molecules. To this end, we evaluated its capacity to design natural products (NPs), which are invaluable sources of inspiration for medicinal chemistry64,65. Compared to synthetic small molecules, NPs tend to possess more intricate molecular structures and ring systems, as well as a larger fraction of sp3-hybridized carbon atoms and chiral centers66,67,68. These characteristics correspond to longer SMILES sequences on average, with more long-range dependencies, and make natural products a challenging test case for CLMs19,69.We trained the CLMs on large natural products (32,360 SMILES strings with length > 100, chosen to complement the previous analysis) from the COlleCtion of Open Natural ProdUcTs (COCONUT) database70. We then used the CLMs to design 102,400 SMILES strings de novo and computed the fraction of valid, unique, and novel designs (Table 2). All CLMs can design natural products, with lower performance compared to drug-like molecules. S4 designs the highest number of valid molecules by approximately 6000 to 12,000 molecules (7–13% better), and LSTM achieves the highest novelty by approximately 2000 molecules (2%) over S4.Table 2 Natural product design with CLMsTo further investigate the characteristics of the designs, we computed the natural-product likeness71, which captures how similar a molecule is to the chemical space covered by natural products in terms of its substructures (the higher the NP-likeness, the more similar). The novel designs of S4 have significantly higher values of NP-likeness than the benchmarks (Mann–Whitney U test, p = 1.41e−53 compared to LSTM, and p = 1.02e−82 compared to GPT), closer to the values of the training and test sets on average (Table 2). Moreover, the NP-likeness values better match the distribution of the COCONUT molecules in terms of Kolmogorov–Smirnov (KS) distance72, which quantifies how much the cumulative distributions of two observations differ (between 0% and 100%; the lower, the closer the distributions).In addition to NP-likeness, we evaluated the novel designs in terms of several structural properties important for natural products66,67,68, namely: the number of sp3-hybridized carbon atoms, aliphatic rings, spiro atoms and heavy atoms, as well as the molecular weight and the size of the largest fused ring system. These properties provide additional evidence on the molecular characteristics of the designs, and their structural complexity in comparison with the training natural products. Here, S4 achieved the lowest KS distance to the training and test sets across properties, indicating that its designs match the training natural products best. These results confirm the ability of S4 to learn complex molecular properties for de novo design.Finally, we analyzed the training and generation speed of the CLM architectures when increasing the SMILES length, to test their practical applicability when designing bigger molecules, like natural products. Our analysis highlighted that S4 is as fast as GPT during training (both are approximately 1.3 times faster than LSTM), and the fastest in terms of generation (Supplementary Fig. 4), thanks to its dual formulation. This further advocates for the introduction of S4 as an efficient approach for molecule design, that “makes the best of both worlds” compared to GPT and LSTM.Prospective de novo designWe conducted a prospective in silico study with S4, focused on designing inhibitors of mitogen-activated protein kinase 1 (MAPK1), a relevant target for oncological therapies73. The putative bioactivity of the designs was then evaluated via molecular dynamics (MD).The S4 model previously pre-trained on ChEMBL was fine-tuned with the SMILES strings of 68 manually-curated inhibitors from ChEMBL, having an experimental constant of inhibition (Ki) lower than 1 μM on MAPK1. The last five epochs of the fine-tuned model were then used to generate 256K molecules (51,200 designs per each T value, ranging from 1.0 to 2.0 with a step of 0.25).The designs were ranked and filtered via log-likelihood score (Eq. (5)) and scaffold similarity to the training set (see “Materials and methods” for further details). The ten top-scoring molecules (1–10, Fig. 5a and Table 3) were considered for further characterization using MD simulations. As a reference for evaluation, we performed MD simulations also for the closest fine-tuning neighbor of the considered designs (compounds 11–16, selected based on scaffold similarity; Fig. 5a and Table 3). The absolute protein-ligand binding free energy (expressed as ΔG – the lower the stronger the predicted binding) for molecules 1–16 was computed via Umbrella Sampling74 (Table 3). The computed ΔG values for known bioactive molecules (11–16) have a good correspondence with experimental Ki values from ChEMBL (Table 3), confirming the validity of the chosen MD protocol.Fig. 5: Prospective de novo design of putative MAPK1 inhibitors with S4.a Selected de novo designs (molecules 1–10) for further characterization. For each de novo design, its most similar training MAPK1 inhibitor (as reported in Table 3) is depicted (compounds 11–16, gray box). The ligand binding pose (obtained via Umbrella Sampling) of selected designs interacting with MAPK1 (PDB-ID = 2Y9Q), in comparison with their most similar bioactive molecule from the fine-tuning set is also depicted: b Design 2 (green) compared with compound 12 (gray). c Design 3 (yellow) compared with compound 13 (gray). d Design 9 (blue) compared with compound 13 (gray).Table 3 In silico prospective study on designing mitogen-activated protein kinase (MAPK1) inhibitors with S4Eight out of ten designs (except 1 and 5) showed a high predicted affinity (Table 3), with ΔG values ranging from ΔG = −10.3 ± 0.6 kcal mol−1 (7) to ΔG = −23 ± 4 kcal mol−1 (2). Interestingly, these affinities are comparable or even surpass those of the closest active neighbor (ΔG = −9.1 ± 0.8 kcal mol−1 to ΔG = −13 ± 2 kcal mol−1). The global substructure similarity (measured on extended connectivity fingerprints) of the designs to their closest neighbor ranges from 31% (10) to 87% (4, Table 3).The most potent design according to MD predictions is molecule 2 (ΔG = −23 ± 4 kcal mol−1, Table 3). This molecule – which is the largest one among the designs (Fig. 5a) – engages extensively with the binding pocket of MAPK1 (Fig. 5b), which explains the remarkably favorable predicted affinity. Design 2 has a limited substructure similarity to its closest bioactive neighbor (molecule 12, similarity equal to 57%); however, its synthetic accessibility may be limited. Design 3 is predicted with the second highest affinity (ΔG = −19.6 ± 0.9 kcal mol−1), and it shares the same scaffold of compound 13. Design 3 differs from 13 by the replacement of the ether and hydroxy moieties with two fluorine atoms, and the addition of a methoxy group (Fig. 5a, global similarity equal to 65%). Interestingly, this structural modification leads to an improvement of the predicted ΔG value (of approximately −10 kcal mol−1), possibly due to the ability to penetrate deeply into the binding pocket thanks to the fluorine atoms (Fig. 5c). Halogens are, in fact, favorable for MAPK1, as evident from the fine-tuning molecules (91% of them containing at least one halogen) and existing literature (e.g., refs. 75,76,77,78). Evidence of a favorable positioning of halogens is shown on both the “top”75,76 and “bottom”77,78 of the binding pocket, further supporting the predicted affinity of compound 3.Design 9 (ΔG = −17 ± 2 kcal mol−1) features halogens on both sides, unlike its closest neighbor, molecule 13 (ΔG = −10.5 ± 0.7 kcal mol−1, global similarity equal to 33%), from which it also differs in the moiety attached to the pyridonic ring (Fig. 5a). When inspecting the predicted binding pose, it can be observed that the aromatic ring with halogen substituents, hydroxyl, and carbonyl of pyridone are situated in the same region of the binding groove (Fig. 5d). The difference in ΔG values (approximately 6.5 kcal mol−1 in favor of design 9) could be ascribed, like with molecule 3, to the presence of halogens in the lower binding pocket region. This might also explain the high predicted affinity of design 10 (ΔG = −15 ± 2 kcal mol−1) – which differs from 9 by a carbonyl and a methyl group.With 8 out of 10 designs predicted as bioactive on the intended target by MD, with comparable or higher predicted affinities than their closest fine-tuning molecules, these results further corroborate the potential of S4 for de novo drug design.Opportunities for molecular S4In conclusion, this study pioneered the introduction of state space models into chemical language modeling, with a focus on structured state spaces (S4s). The unique dual nature of S4s, involving convolution during training and recurrent generation, makes them particularly intriguing for de novo design starting from SMILES strings.Our systematic comparison with GPT and LSTM on a variety of drug discovery tasks revealed S4’s strengths: while recurrent generation (LSTM and S4) is superior in learning the chemical syntax and exploring diverse scaffolds, learning holistically on the entire SMILES sequence (GPT and S4) excels in capturing certain complex properties, like bioactivity. S4 with its dual nature, makes the best of both worlds”: it demonstrated comparable or better performance than LSTM in designing valid and diverse molecules, and systematically outperformed both benchmarks in capturing complex molecular properties – all while maintaining computational efficiency.The application of S4 to MAPK1 inhibition, validated by MD simulations, further showcases its potential to design potent bioactive molecules. In the future, we will apply S4 prospectively in combination with wet-lab experiments to enhance its impact in the field. Strategies to increase the structural diversity of the considered designs, such as SMILES augmentation79 and improved ranking protocols, could further boost its potential in medicinal chemistry.Several aspects of S4 await to be explored in the molecular sciences, such as its potential with longer sequences (e.g., macrocyclic peptides and protein sequences) and on additional molecular tasks (e.g., organic reaction planning80 and structure-based drug design81).In the future, we envision the relevance of S4 for molecule discovery to increase, and to potentially replace widely established chemical language models like LSTM and GPT. We believe that the provided open-access code will contribute to the adoption and expansion of S4, to further stretch the boundaries of chemical language modeling.

Hot Topics

Related Articles