Exhaustive local chemical space exploration using a transformer model

Transformer model with a regularized loss functionA transformer11,24,25 architecture trained on a SMILES (simplified molecular-input line-entry system) string representation of a molecule was used. The purpose of the study is to compare our proposed framework for training a molecular transformer with a regularized loss function with existing unconstrained training framework25. However, our framework can in general be applied to any transformer model based on molecular pairs constructed with a similarity metric. Formally, we denote with \({\mathcal{X}}\) the chemical space, and with \({\mathcal{P}}=\{(s,t)\,| \,s,t\in {\mathcal{X}}\times {\mathcal{X}}\}\) the set of molecular pairs constructed from \({\mathcal{X}}\), s and t are the source and target molecules, respectively. Each element of sources and targets is a token (symbol) which takes values from a vocabulary (ordered set) V. Furthermore, we denote with \({f}_{\theta }:{\mathcal{X}}\times {\mathcal{X}}\to {[0,1]}^{| V| }\) the transformer with θ being the set of its parameters. To simplify, we denote with the same symbol a molecule and its SMILES string representation. The transformer, fθ, consists of an encoder and a decoder which are simultaneously fed by source molecules and target molecules, respectively. Source molecules and target molecules are first tokenized and then converted into sequences of dense vector embeddings and passed over several stacked encoder and decoder layers, respectively. The final encoder output is merged into the decoder through a multi-head attention mechanism which captures the transformation from a source molecule to a target molecule. Figure 5 shows the general architecture of our transformer model, more details about the transformer architecture can be found in refs. 11,24,25.Fig. 5: Transformer training and inference flows are shown.The training is depicted to left (a), while the inference to the right (b). a The transformer model receives as input pairs of molecules consisting of a source molecule and a target molecule (represented as SMILES strings). The SMILES associated to source and targets are CC(Cc1ccc(Cl)cc1)NCC(O)C(N)=O and CC(Cc1ccc(O)cc1)NCC(O)C(N)=O, respectively. The initial character of the target SMILES provided as input is the starting token ˆ. The model is trained to transform a source molecule to a target molecule by producing its next tokens (in parallel). The last character of the produced output is the ending token $. At training molecular pairs of source and target molecules are used to train the transformer model. b At inference a source molecule is transformed into several target molecules. In the figure, those are sampled target1 and sampled target2 corresponding to SMILES CC(Cc1ccc(Cl)cc1)NCC(O)C(N)=S and CC(Cc1ccc(Cl)cc1)NCC(F)C(N)=O, respectively.fθ models the probability p of the ℓ-th token of a target tiℓ given all the previous ti,1:ℓ−1 = tiℓ−1, …, ti1 target tokens and source si compound, i.e., fθ(ti,1:ℓ−1, si)[tiℓ] = p(tiℓ∣ti,1:ℓ−1, si). The transformer’s parameters θ are then trained on \({\mathcal{P}}\) by minimizing the negative log-likelihood (NLL) of the entire SMILES string, p(ti∣si), \(({s}_{i},{t}_{i})\in {\mathcal{P}}\) for all \(i=1,\ldots,| {\mathcal{P}}|\), that is$${\text{NLL}}_{i}= -\log p({t}_{i}| {s}_{i})=-\log \mathop{\prod }\limits_{\ell=1}^{L}p({t}_{i\ell }| {t}_{i\ell -1},\ldots,{t}_{i1},\,{s}_{i})\\= -\mathop{\sum }\limits_{\ell=1}^{L}\log p({t}_{i\ell }| {t}_{i\ell -1},\ldots,{t}_{i1},{s}_{i})=-\mathop{\sum }\limits_{\ell=1}^{L}\log {f}_{\theta }({t}_{i,1:\ell -1},\,{s}_{i})[{t}_{i\ell }],$$
(2)
where L denotes the number of tokens associated to ti, and [tiℓ] denotes the index of vector fθ(ti,1:ℓ−1, si) corresponding to the token tiℓ. The NLL represents the probability (precedence) of transforming a given source molecule into a specific target molecule. The NLL is always non-negative and the higher the NLL value, the less likely a target molecule will be generated. A NLL equal to 0.0 would imply that a specific target molecule would have the probability of 1.0 to be generated from the source molecule.The loss in Eq. (2) allows the transformer to learn but it associates equal probabilities (i.e., same negative log-likelihood) to all the target compounds associated to the same source molecule. This uniform assignment occurs because the model observes these target compounds an equal number of times. This behavior is not ideal since during inference we would like the probability of generating a target molecule given a specific source molecule, p(t∣s), to be proportional to the similarity between the molecules. To mitigate this issue, we introduce in Eq. (3), a regularization term to the loss in Eq. (2) which penalizes the NLL if the order relative to a similarity metric is not respected.$$\Omega (({s}_{i},\,{t}_{i}),\,({s}_{j},{t}_{j}))=\max (0,\,{\rm{sign}}\,(\kappa ({s}_{i},\,{t}_{i})-\kappa ({s}_{j},\,{t}_{j}))\,({\rm{NLL}}_{i}-{\rm{NLL}}_{j})),$$
(3)
where (si, ti) and (sj, tj) are two pairs from \({\mathcal{P}}\), and κ is an arbitrary kernel function. The Tanimoto similarity was chosen as κ but our framework is general so any valid kernel can be used. Note that the NLL is always non-negative in this context. During training, we sample a batch of source-target molecule pairs and compute the regularization term in Eq. (3) for all the molecular pairs included in the batch. Figure 6 depicts an example of the ranking loss calculation. Finally, to train the model we propose the following loss function as a combination of Eqs. (2) and (3)$${\mathcal{L}}=\frac{1}{| {\mathcal{P}}| }\mathop{\sum }\limits_{i=1}^{| {\mathcal{P}}| }{\text{NLL}}_{i}+\frac{\lambda }{| {\mathcal{P}}| (| {\mathcal{P}}| -1)}\mathop{\sum }\limits_{i=1}^{| {\mathcal{P}}| -1}\mathop{\sum }\limits_{j=i+1}^{| {\mathcal{P}}| }\Omega (({s}_{i},\,{t}_{i}),({s}_{j},\,{t}_{j})),$$
(4)
where λ ≥ 0 is a hyper-parameter controlling the weight of the regularization term.The molecular similarity is in this study related to the notion of precedence. Precedence is learnt by a transformer model based on the training data and represents the probability for a source compound to be transformed into a specific target compound. The NLL can be used as proxy for the precedence where a low NLL means a high precedence and vice versa. Figure 7 illustrates an example where a nitrogen atom in penicillin is replaced by a oxygen atom, phosphorus atom, or arsenic atom, respectively. It is important to notice that the Tanimoto similarities (calculated with the count version of ECFP4 fingerprints) between Penicillin and the modified penicillin analogs are exactly the same, while their NLLs are different, as the penicillin analog with an arsenic atom has lower precedence than the penicillin analog with a phosphorus atom, and the penicillin analog with a phosphorus atom has lower precedence than the penicillin analog with an oxygen atom and the penicillin analog with an oxygen has lower precedence than penicillin itself. We can therefore expect that the penicillin analog with an oxygen atom will be retrieved with a much higher precedence (probability) during inference than the penicillin analogs with a phosphorous atom or arsenic atom, respectively.Fig. 6: Two pairs of molecules (s1, t1), (s2, t2) with Tanimoto similarity 0.71 and 0.76, respectively, are shown.The Tanimoto similarity is represented by κ in the Figure, and s1 and s2 denote two source molecules, while t1 and t2 two targets molecules. NLL1 and NLL2 are the negative log-likelihood associated with (s1, t1) and (s2, t2), respectively. In the left example NLL2 > NLL1, therefore the regularization term Ω (see Eq. (3)) is greater than 0. This indicates that NLL1 and NLL2 are incorrectly ordered, which we emphasize by enclosing Ω = 1.8 in a red box in the left picture. In the right example NLL1 > NLL2. In this case, Ω = 0 as the similarities and NLLs have the same rank-order, which we emphasize by enclosing Ω = 0.0 in a green box in the right picture.The implementation relies on Python 3.9 with the following libraries: cupy-cuda11 10.6.0, lightning 2.0.1, MolVS 0.1.1, natsort 7.1.1, numpy 1.24.4, pyyaml 6.0.1, rdkit 2022.9.7, scipy 1.10.1, torch 1.12.1, and tqdm 4.65.0. The datasets used in the paper are available at https://doi.org/10.5281/zenodo.12818281. Source data are provided with this paper.Fig. 7: Penicillin (used here as the source molecule) and its analogs where a nitrogen atom (N) is replaced with an oxygen atom (O), phosphorus atom (P), and arsenic atom (As), respectively.The area enclosed within the orange circles in the lower part of the figure contains the atoms that differ from the penicillin. The molecular similarity, denoted with S in the figure, between Penicillin and the modified analogs are the same, while their corresponding NLLs (negative log-likelihoods or precedence) are different. The molecular transformation from penicillin to the analog with an arsenic atom has lower precedence (higher NLL) than transforming penicillin to the analog with phosphorus and oxygen, respectively. The derivative with an oxygen atom has the highest precedence after penicillin. The Tanimoto similarity is denoted with the letter S on top of the penicillin analogs in the lower part of the figure.Approximately exhaustive sampling of the chemical spaceTwo different techniques can be used to sample target molecules with a molecular transformer: multinomial sampling (used e.g., by He et al.24,25) and beam search34. Multinomial sampling allows fast generation of compounds distributed according to their NLL. Given a source compound \(s\in {\mathcal{X}}\), the length L of the tokens in the SMILES string associated with s, and V the vocabulary, i.e., the set of possible tokens, the computationally complexity of multinomial sampling is O(L ⋅ ∣V∣). However, multinomial sampling suffers from mode collapse i.e., the same target compound might be sampled multiple times, and the method is not deterministic, i.e., different runs produce different target compounds. Beam search is, in contrast to multinomial sampling, deterministic but computationally more complex \({\mathcal{O}}(B\cdot L\cdot | V| )\), where B is the beam size. Beam search retains B unique SMILES strings sorted by their corresponding NLL. Note that for both the techniques, the complexity of the underlying transformer model impacts the performance. This complexity arises because SMILES strings are generated iteratively by feeding the transformer with n − 1 tokens to obtain the n-th. In fact, for multinomial sampling, the model needs to compute the probabilities of each token in the vocabulary, while for beam search, we need to store the B SMILES subsequences with the most favorable NLL. Similarly to multinomial sampling, we also need to compute the probabilities of each token in the vocabulary for each subsequence. Note that beam search is an approximate exhaustive search and it might miss compounds with a favorable NLL.Data preparationMolecular structures were downloaded from PubChem as SMILES strings. In total 102,419,980 compounds were downloaded (PubChem dynamically grows. This number reflects the available compounds by the end of December 2021). The dataset was pre-processed as follows:

the SMILES strings were standardized using MolVS (https://molvs.readthedocs.io/en/latest) including the following steps: sanitize, remove hydrogens, apply normalization rules, re-ionize acids, and keep stereo chemistry;

all duplicate structures were removed;

all the compounds containing an isotopic label were removed.

Starting from the set of pre-processed SMILES strings \({\mathcal{X}}\) (containing 102,377,734 SMILES strings), we constructed a dataset \({\mathcal{D}}=\{(s,t)\,| \,s,t\in {\mathcal{P}},\,\kappa (s,t)\ge 0.5\}\) containing all the pairs having a Tanimoto similarity κ(s, t) ≥ 0.50. The Tanimoto similarity is calculated with the RDKit35 (version 2022.09.5) Morgan binary fingerprint (ECFP4) with radius equals to 2, 1024 bits, calculated from the SMILES strings in \({\mathcal{X}}\). The number of molecular pairs in \({\mathcal{D}}\) is 217,111,386,586, which is only 0.002% of the over 1016 possible molecular pairs). The Tanimoto similarity for the molecular pairs can be computed efficiently by storing the fingerprints in a binary matrix of size N × 128 bytes in the GPU memory. In this way, the Tanimoto similarity among all the possible pairs can be efficiently computed by utilizing GPU parallelism. For each SMILES string in \({\mathcal{X}}\) we can calculate the Tanimoto similarity at once against all the other SMILES strings. The calculation of all molecular pairs took 10 days on 16 A100 GPUs with 40GB of memory.Starting from \({\mathcal{D}}\), we constructed an additional dataset \({{\mathcal{D}}}^{c}\) where we kept all the pairs having a ECFP4 Tanimoto similarity with counts greater or equal than 0.50. The major advantage of Morgan fingerprints with counts compared to their binary counterpart is their ability to capture the frequency of a substructure within a molecule. Using count fingerprints makes it possible to differentiate between molecules having the same substructures but different frequency of them and therefore a lower similarity value can be assigned when comparing the two molecules. Supplementary Fig. 2 shows an example of the Tanimoto similarity between two molecules with the binary and count fingerprints. The Tanimoto similarity with count fingerprints cannot be computed as efficiently as for binary fingerprints, therefore we first construct \({\mathcal{D}}\) and then refine it to obtain \({{\mathcal{D}}}^{c}\), where for each pair \({\mathcal{D}}\) we recomputed the Tanimoto similarity on fingerprints with count and kept only those with values greater or equal than 0.50. \({{\mathcal{D}}}^{c}\) contains 173,911,600,788 pairs. In the “Results” section, we will show results for both \({\mathcal{D}}\) and \({{\mathcal{D}}}^{c}\).Transformers receive a sequence of integers, therefore each SMILES string is first tokenized and then translated into a specific integer number. The tokens are collected into a dictionary where keys are tokens and values are integers, that is$$V=\left\{*: 0,\!\! \hat {\;\;\;\;:\;} 1,\, \$ : 2, < {\rm{UNK}} > : 3,\ldots,\, 1 : 48,\ldots,\, {\rm{C}} : 60,\, {\rm{Cl}} : 61,\ldots \right\}.$$Note that V contains 4 special tokens: * is the padding token used to enforce the same length of all SMILES, ˆ is the starting token, $ the ending token, and <UNK> is the unknown token used at inference time if a new token is observed. We constructed V from \({\mathcal{D}}\) which contains 455 different tokens. Supplementary Fig. 1 shows an example of the SMILES string tokenization procedure.Model training and samplingFour transformer models were trained on \({\mathcal{D}}\) and \({{\mathcal{D}}}^{c}\) and with and without ranking loss. In our experiments, we employed a standard transformer model11,25 with the following key parameters:

Number of layers:

— Encoder layers: the model consists of N = 6 identical layers in the encoder stack. Each encoder layer consists of two main sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.

— Decoder layers: Similarly, the decoder stack also contains N = 6 identical layers. Each decoder layer includes an additional sub-layer for multi-head attention over the encoder’s output.

Hidden dimension: the dimensionality of the input and output vectors, denoted as dmodel was set to 256. This dimension is consistent across all layers and serves as the size of the embeddings and the internal representations within the model.

Feed-forward network dimension: within each layer, the position-wise feed-forward networks expand the dimensionality to df f = 2048. This expansion is achieved through two linear transformations with a ReLU activation in between.

Number of attention heads: each multi-head attention mechanism is composed of h = 8 attention heads. The model splits the dmodel = 256 into 8 subspaces of size 16. These heads allow the model to focus on different parts of the input sequence simultaneously.

Dropout rate: to mitigate overfitting, a dropout rate of 0.1 was applied to the output of each sub-layer before adding the residual connection and layer normalization.

Every model was trained for 30 epochs, utilizing 8 A100 GPUs, with each training cycle lasting for 30 days. During an epoch all the source-target molecular pairs in the training set are included once. All models were trained following the same strategy and using the same hyperparameters as in ref. 25, including a batch size of 128, Adam optimizer and the original learning rate schedule11 with 4000 warmup steps. Due to the computational time required to train a model, we could not optimize λ (see the “Transformer model with a regularized loss function” section). However, while not necessarily optimal, the value we chose for λ, i.e., λ = 10 already highlights (see the “Results” section) the benefits from using the ranking loss when assessing the overall quality of the models.Once trained, the models can be used to generate target molecules conditioned on a source molecule by predicting one token at a time. Initially, the decoder processes the start token along with the encoder outputs to sample the next token from the probability distribution over all the tokens in the vocabulary. The generation process iteratively continues by producing the next token from the encoder outputs and all the previous generated tokens until the end token is found or a predefined maximum sequence length (128) is reached. To allow for the sampling of multiple target molecules, beam search is used (see the “Approximately exhaustive sampling of the chemical space” section), and unless otherwise stated, a beam size of 1000 was used.Evaluation setupThe model was evaluated on two publicly available datasets: the Therapeutic Target Database (TTD)36 and ChEMBL-series which consists of compound series from recent scientific publications extracted from the ChEMBL database27. TTD contains clinically investigated drug candidates, which we used to investigate exhaustive sampling of the near-neighborhood for drug-like molecules. The compound series from publications contains only novel molecules that were not part of our training data, resulting in a out-of-distribution set of molecules that we cluster into chemical series based on the publication that they were extracted from.Each dataset was pre-processed using the strategy described in the “Model training and sampling” section, and compounds that contained tokens not present in the vocabulary V (used to train the models) were removed. A final filtering was applied to both datasets in order to remove peptides and other non-drug-like small molecules. Only compounds that satisfied all the following criteria were kept:

Lipinski rule of five compliant37;

molecular weight larger than 300 Dalton;

less than eight ring systems.

The final TTD dataset contains 821 compounds, while the ChEMBL-series dataset contains 2685 compounds distributed in 200 series with 5 to 60 compounds in each series. Both curated datasets are released together with the code. Notably, compounds from ChEMBL-series were selected to be distinct from both training sets \({\mathcal{D}}\) and \({{\mathcal{D}}}^{c}\), ensuring no overlap in between the sets.Evaluation metricsTo evaluate the impact of ranking loss (see Eq. (3)) on a fully trained model we considered several metrics (for all of them the higher the better):

VALIDITY: the percentage of target compounds generated by the transformer model that are valid according to RDKit. VALIDITY is calculated by averaging the percentage of valid target compounds sampled for each source compound;

UNIQUENESS: the percentage of unique target compounds generated by the transformer model. In order to evaluate the uniqueness, the generated valid target compounds are canonicalized with RDKit to identify duplicates; UNIQUENESS is calculated by averaging the percentage of valid unique target compounds sampled for each source compound;

TOP IDENTICAL: the number of cases where the target compound with the lowest NLL is identical to the source compound. TOP IDENTICAL allows to evaluate whether the ranking loss forces the transformer model to generate the source compound as the generated target compound with the lowest NLL. Note that, κ(s, s) = 1 for all possible source compounds s. TOP IDENTICAL is calculated by averaging over the source compounds;

RANK SCORE: the Kendall’s tau score τ between the Tanimoto similarity and the NLL for the top ten compounds sampled by beam search. The score measures the correspondence between the two rankings. The score have values in [−1, 1] range, where the extremes denote perfect disagreement and agreement, respectively. Given a source compound s and N generated compounds \({\hat{t}}_{1},\ldots,{\hat{t}}_{N}\) from s, τ is computed as:$$\tau=\frac{2}{N(N-1)}\sum _{i < j}\,{\rm{sign}}\,(\log p({\hat{t}}_{j}| s)-\log p({\hat{t}}_{i}| s))\,\,{\rm{sign}}\,(\kappa (s,{\hat{t}}_{j})-\kappa (s,{\hat{t}}_{i})).$$
(5)

CORRELATION: the Pearson correlation coefficient between the Tanimoto similarity and the NLL, which measures the linear correlation between the Tanimoto Similarity and the NLL. It have values in the [ − 1, 1] range, where the extremes denote perfect disagreement and agreement, respectively. Given a set of pairs \(P={\{({x}_{i},\,{y}_{i})\,| \,{x}_{i}\in {\mathbb{R}},\,{y}_{i}\in {\mathbb{R}}\}}_{i=1}^{N}\) the Pearson correlation coefficient is computed as:$$\rho=\frac{\mathop{\sum }\nolimits_{i=1}^{N}({x}_{i}-\bar{x})({y}_{i}-\bar{y})}{\sqrt{\mathop{\sum }\nolimits_{i=1}^{N}{({x}_{i}-\bar{x})}^{2}}\sqrt{\mathop{\sum }\nolimits_{i=1}^{N}{({y}_{i}-\bar{y})}^{2}}},$$
(6)
where \(\bar{x}\) (and similarly \(\bar{y}\)) is the average of all the xi, i.e., \(\bar{x}=1/N\mathop{\sum }\nolimits_{i=1}^{N}{x}_{i}\). CORRELATION is calculated by averaging over all the sampled target compounds.

Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles