Details of TamGenWe describe the details about how to process the 3D structure input, the architectures of the protein encoder, the chemical language model, the contextual encoder and the training objective functions.Preliminaries: Let a = (a1, a2, ⋯ , aN) and r = (r1, r2, ⋯ , rN) denote the amino acids and their 3D coordinates of a binding pocket respectively, where N is the sequence length and \({r}_{i}\in {{\mathbb{R}}}^{3}\) is the centroid of amino acid i (i is an index to label the amino acids around the binding site). ai is a one-hot vector like ( ⋯ , 0, 0, 1, 0, ⋯ ), where the vector length is 20 (the number of possible amino acid types) and the only 1 locates at the position corresponding to the amino acid type. A binding pocket is denoted as x = (a, r) and [N] = {1, 2, ⋯ , N}. Let y = (y1, y2, ⋯ , yM) denote the SMILES string of the corresponding ligand/compound with a length M. Our goal is to learn a mapping from x = (a, r) to y.Processing 3D input: The amino acid ai ∀ i ∈ [N] is mapped to d-dimensional vectors via an embedding layer Ea. Following our previous exploration on modeling the 3D coordinates61, the coordinate ri(i ∈ [N]) is mapped to a d-dimensional vector via a linear mapping. Considering we can rotate and translate a binding pocket while its spatial semantic information should be preserved, we apply data augmentation to the coordinates. That is, in the input layer, for any i ∈ [N],$${h}_{i}^{(0)}={E}_{a}{a}_{i}+{E}_{r}\rho \left({r}_{i}-\frac{1}{N}\sum\limits_{j=1}^{N}{r}_{j}\right),$$
(1)
where (i) Ea and Er are learnable matrices, and they are optimized during model training; (ii) ρ denotes a random roto-translation operation, and before using ρ, we center the coordinates to the origin. Thus we process the discrete input x into N continuous hidden representations \({h}_{i}^{(0)}\).Protein encoder: The encoder stacks L identical blocks. The output of the l-th block, i.e., \({h}_{i}^{(l)}\), is fed into the (l + 1)-th layer for further processing and obtain \({h}_{i}^{(l+1)}\) for any i ∈ [N] and l ∈ {0} ∪ [L − 1]. Each block consists of an attention layer and an FFN layer, which is a two-layer feed-forward network as that in the original Transformer25. To model the spatial distances of amino acids, we propose a new type of distance-aware attention. Mathematically,$$\begin{array}{l}{\tilde{h}}_{i}^{(l+1)}=\sum\limits_{j=1}^{N}{\alpha }_{j}({W}_{v}{h}_{j}^{(l)}),\\ {\alpha }_{j}=\frac{\exp {\hat{\alpha }}_{j}}{{\sum }_{k=1}^{N}\exp {\hat{\alpha }}_{k}},\\ {\hat{\alpha }}_{j}=\exp \left(-\frac{\parallel {r}_{i}-{r}_{j}{\parallel }^{2}}{\tau }\right)({h}_{i}^{(l)\top }W{h}_{j}^{(l)}),\end{array}$$
(2)
where W and Wv are parameters to be optimized, and τ is the temperature hyperparameter to control. After that, \({\tilde{h}}_{i}^{(l+1)}\) is processed by an FFN layer and obtain$${h}_{i}^{(l+1)}={\mathtt{FFN}}\left({\tilde{h}}_{i}^{(l+1)}\right).$$
(3)
The output from the last block, i.e., \({h}_{i}^{(L)}\forall i\in [N]\), is the eventual representations of x from the encoder.The contextual encoder: To facilitate diverse generation, we follow the VAE framework and use a random variable z to control the diverse generation for the same input. Given a protein binding pocket x, a compound y is sampled according to the distribution p(y∣x, z; Θ). The contextual encoder (i.e., the VAE encoder) models the posterior distribution of z given a binding pocket x and the corresponding ligand y. The input of VAE encoder is defined as follows:$${h}_{i}^{(0)}=\left\{\begin{array}{l}{E}_{a}{a}_{i}+{E}_{r}\rho \left({r}_{i}-\frac{1}{N}\sum\limits_{j=1}^{N}{r}_{j}\right),\quad i\le N\quad \\ {E}_{y}{y}_{i-N},\hfill i \; > \, N,\quad \end{array}\right.$$
(4)
where Ey is the embedding of the SMILES. The VAE encoder follows the architecture of standard Transformer encoder25, which uses the vanilla self-attention layer rather than the distance-aware version due to the non-availability of the 3D ligand information. The output from the last block, i.e., \({h}_{i}^{(L)}\forall i\in [N]\), is mapped to the mean μi and covariance matrix Σi of position i via linear mapping, which can be used for constructing q(z∣x, y), by assuming q(z∣x, y) is Gaussian. The ligand representations, i.e., \({h}_{j}^{(L)}j \; > \, N\), are not used to construct q(z∣x, y).Chemical language model: The chemical language model is exactly the same as that in ref. 25, which consists of the self-attention layer and the FFN layer. We pre-train the decoder on 10M compounds randomly selected from PubChem (denoted as \({{{{\mathcal{D}}}}}_{0}\)) using the following objective function:$$\min -\sum\limits_{y\in {{{{\mathcal{D}}}}}_{0}}\frac{1}{{M}_{y}}\sum\limits_{i=1}^{{M}_{y}}\log P({y}_{i}| {y}_{i-1},{y}_{i-2},\cdots \,,{y}_{1}),$$
(5)
where My is the length of y. The chemical language model is pre-trained on eight V100 GPUs for 200k steps.After pre-training the chemical language model, the cross-attention module is introduced to the compound decoder as shown in Fig. 1c (top panel). It takes all \({h}_{i}^{(L)}\) as inputs. Under the VAE variant, during training and compound refinement, the inputs are \({h}_{i}^{(L)}+{z}_{i}^{{\prime} }\), where \({z}_{i}^{{\prime} }\) is sampled from the distribution q(z∣x, y) introduced above. During inference, the inputs are \({h}_{i}^{(L)}+{z}_{i}\) where zi is randomly sampled from \({{{\mathcal{N}}}}(0,I)\).Training: The training objective is to minimize the following function:$${\min }_{\Theta,q}\frac{1}{| {{{\mathcal{D}}}}| }\sum\limits_{({{{\bf{x}}}},{{{\bf{y}}}})\in {{{\mathcal{D}}}}}-\log P({{{\bf{y}}}}| {{{\bf{x}}}},z;\Theta )+\beta {{{{\mathcal{D}}}}}_{{{{\rm{kl}}}}}\left(q(z| {{{\bf{x}}}},{{{\bf{y}}}})\parallel p(z)\right).$$
(6)
In Eq. (6), \({{{\mathcal{D}}}}\) is the training corpus, a collection of (pocket, SMILES) pairs; z in \(\log P(\cdots )\) is sampled from q(z∣x, y); β is a hyperparameter; p(z) denotes the standard Gaussian distribution; \({{{{\mathcal{D}}}}}_{{{{\rm{kl}}}}}\) denotes the KL divergence; Θ denotes the parameter. The first term in Eq. (6) enables the model to learn how to generate a reasonable ligand based on the input pocket. The second term, the KL divergence constraint, ensures that the learned latent space resembles the prior distribution p(z), which promotes smooth interpolation and generalization. It regularizes the encoder by pulling the approximate posterior distribution (the encoder’s output) closer to the prior distribution (typically a simple Gaussian). This helps in preventing overfitting and ensures meaningful latent representations.Implementation details For the results in Fig. 2, for fair comparison with the previous methods like Pocket2Mol41, Targetdiff19, we use the same data as them. The data is filtered from CrossDocked43 and there are about 100k target-ligand pairs. For inference, the z is sampled from multivariant standard Gaussian distribution. Both the pocket encoder and VAE encoder have 4 layers with hidden dimension 256. The decoder has 12 layers with hidden dimension 768. We use Adam optimizer62 with initial learning 3 × 10−5. In the context of generating the compound database for Tuberculosis (TB), the current methodology incorporates an augmented dataset that includes the CrossDocked database and the Protein Data Bank (PDB), cumulatively accounting for approximately 300,000 protein-ligand pairs. To elaborate, this process involved the extraction of pocket-ligand pairs from about 72,000 PDB files. A pocket is defined on the basis of spatial proximity criteria: if any atom of an amino acid is less than 10Å away from any atom of the ligand, the corresponding amino acid is taken as part of the pocket.The phenotype screening predictor LigandformerWe utilized an adapted version of the Graph Neural Network (GNN) model as proposed in ref. 55 to predict potential phenotypic activity. Compared with traditional GNNs, our model is designed such that the output from one layer is propagated to all subsequent layers for enhanced processing. We implemented a 5-layer architecture. Our phenotypic predictor was trained using a dataset of 18,886 samples, which are gathered from a variety of sources including ChEMBL, published datasets, and academic literature as compiled by ref. 63. At the inference stage, we interpreted an output value exceeding 0.69 (a threshold determined based on validation performance) as indicative of a positive sample.Baselines and evaluationsBaselinesWe mainly compare our method with the following baselines:
1.
3D-AR40, a representative deep learning baseline that uses a graph neural network to encode the 3D pocket information and direct generates the 3D conformation of candidate drugs. The atom type and coordinates are generated sequentially. 3D-AR does not explicitly generate the position of the next, by use MCMC for generation.
2.
Pocket2Mol41 is an improved version of 3D-AR, which has specific modules to predict atom type, coordinate positions and bond type.
3.
ResGen42 is also an autoregressive method of generating compounds in 3D space directly. Compared with Pocket2Mol, ResGen uses residue-level encoding while Pocket2Mol uses atomic-level encoding.
4.
TargetDiff19 utilizes diffusion models to generate compounds. Compared with the previous method, all atom types and coordinates are generated simultaneously, and iteratively refined until obtaining a stable conformation.
TamGen without pre-trainingTo assess the impact of pre-training, we introduce a TamGen version without pre-training, in which the compound generator is initialized randomly. We observed overfitting when a 12-layer chemical language model was used in the non pre-trained version. Upon evaluating layers 4, 6, 8, and 12 based on their validation performance, we discovered that a model with 4 layers yielded the most optimal results.Mean reciprocal rank (MRR)Mean Reciprocal Rank (MRR) calculation64 is a widely used method to evaluate a method across different metrics. To elaborate, denote the rank of a method on metric i as ri. The MRR for a particular method is hence defined as \(\frac{1}{N}{\sum }_{i=1}^{N}\frac{1}{{r}_{i}}\), where N represents the total number of evaluation metrics being considered.Fused ringsIn this work, fused rings denote a structural element in compounds where two or more ring structures share at least one common bond. The size of the largest group of these “fused” rings within a molecule is denoted as the number of fused rings. In Fig. 2d, from left to right, the number of fused rings of the four compounds are 2, 5, 4, and 4, respectively.Experimental detailsPeptidase activity assayClpP1P2 complex in Mtb can catalyze the hydrolysis of small peptides. Following previous protocols, we measure the in vitro inhibition of ClpP peptidase activity by monitoring the cleavage of fluorogenic peptide Ac-Pro-Lys-Met-AMC65,66,67.0.4 μL of candidate inhibitors, Bortezomib, or DMSO control are added into a black flat bottom 384-well plate by Echo®20 Liquid Handler and mixed with 20 μL enzyme buffer (The final ClpP1P2 dimer concentration is 50nM; reaction buffers: PIPES 30mM (pH 7.5), NaCl: 200mM and 0.005% Tween20). The solution is pre-incubated at room temperature for 2 hours. Then, 20 μL substrate buffer with Ac-Pro-Lys-Met-AMC is added (final concentration of Ac-Pro-Lys-Met-AMC is 10 μM; reaction buffer is the same with the above). Fluorescence (Ex/Em: 380/ 440 nm) is recorded for 120 min at 37 °C.Single-dose response measurementInhibition rates of compounds were determined by Relative Fluorescence Units (RFU) compared with Bortezomib control68,69 and DMSO control, which is defined as follows:$${{{\rm{Inhibition}}}}\,{{{\rm{Rate}}}}=\frac{{{{\rm{RFU}}}}\,({{{\rm{test}}}})-{{{\rm{RFU}}}}({{{\rm{DMSO}}}})}{{{{\rm{RFU}}}}({{{\rm{bortezomib}}}})-{{{\rm{RFU}}}}({{{\rm{DMSO}}}})}\times 100\%.$$
(7)
In this case, fluorescence of DMSO is seen as none inhibition (0%), and fluorescence of Bortezomib is seen as completed inhibition (100%). Compounds with inhibition rates more than 20% at 20 μM are considered as hits.Dose-response assay and IC50 determinationTo determine IC50, candidate inhibitors are assayed at 9 or 10 gradient concentrations. A series of candidate inhibitor, Bortezomib, or DMSO dilutions is prepared starting from a maximum concentration of 100 μM, with each subsequent concentration being half or one third of the previous one (2-fold or 3-fold dilution gradient). IC50 is determined by the change of recorded fluorescence (as RFU) and gradient dilution of inhibitors concentration. Non-linear fit (log(inhibitor) vs. normalized response) is used for IC50 curve fitting.Compound generation in Design and Refine stages for ClpPCompound generationGiven a complex crystal structure with a protein receptor and a ligand, the center of the ligand is denoted as c. For each residue i of a protein, if its centroid pi satisfies the condition ∥c − pi∥ ≤ τ, i.e., within a distance cutoff τ from the ligand center c, then residue i is included in the pocket, where the distance cutoff τ is pre-defined.In the case of ClpP complex, we first designed compounds based on published complex structure (PDB ID 5DZK). We took two values of τ to be 10 Å and 15 Å. Multiple binding sites can be extracted. We used beam search with beam size 20 to generate compounds. The β of the VAE was set to be 0.1 or 1. We initialized compound generation with 20 unique random seeds, ranging from 1 to 20. After removing duplicate and invalid generated compounds, we obtained 2.6k unique compounds.During the following Refine stage, in addition to the binding pocket information, we included guiding information encoded in 4 representative compounds and 3 experimentally discovered compounds exhibiting weak inhibition activities. The parameter τ was set to 10 Å, 12 Å, and 15 Å. We used beam search with beam sizes of 4, 10, and 20 for compound generation. The β parameter of the VAE was set to 0.1 or 1. We initiated compound generation with 100 unique random seeds, ranging from 1 to 100. After removing duplicates and invalid compounds, we obtained a total of 8.4k unique compounds.UMAP visualizationCompounds are converted to 1024-dimensional vectors with function GetMorganFingerprintAsBitVect from rdkit. UMAP transformation70 is performed with parameters: n_neighbors=20, min_dist=0.7, metric=jaccard.Ligand docking to protein targetThe SMILES of generated compounds were converted to 3D structures with Open Babel program. Subsequently, AutoDock Tools was employed to add hydrogens and assign the Gasteiger charge to both the converted 3D compounds and the RCSB downloaded protein 5DZK before the docking process. The 5DZK ligand-centered maps were defined by the program AutoGrid and grid box was generated with definitions of 20 × 20 × 20 points and 1 Å spacing. Molecular docking was performed with AutoDock Vina program with default settings. The predicted binding poses were visualized using the PyMol program.Pocket-ligand pair shuffling experimentIn this experiment, we randomized the (pocket, ligand) pairs and re-trained a TamGen model, denoted as TamGen-r. Specifically, given the training dataset \({{{\mathcal{D}}}}={\{({{{{\bf{x}}}}}_{i},{{{{\bf{y}}}}}_{i})\}}_{i=1}^{N}\), we shuffled the indices I = {1, 2, ⋯ , N} using the python code random.shuffle(I) and got the shuffled indices {ς(1), ς(2), ⋯ , ς(N)}. Consequently, we obtained a randomized dataset \(\tilde{{{{\mathcal{D}}}}}={\{({{{{\bf{x}}}}}_{\varsigma (i)},{{{{\bf{y}}}}}_{i})\}}_{i=1}^{N}\). We then followed the same training and inference procedures as standard TamGen for TamGen-r.Ablation studies on self-attention layersThe TamGen variant without the distance-aware attention (denoted as TamGen w/o dist_attn): the third equation in Eq. (2) is replaced with$${\hat{\alpha }}_{j}={h}_{i}^{(l)\top }W{h}_{j}^{(l)},$$
(8)
The TamGen variant without the data augmentation from the coordiates (denoted as TamGen w/o coord_aug): replace Eq. (1) with$${h}_{i}^{(0)}={E}_{a}{a}_{i}+{E}_{r}\left({r}_{i}-\frac{1}{N}\mathop{\sum }_{j=1}^{N}{r}_{j}\right),$$
(9)
Criteria for additional compounds selectionThe selection criteria for the additional 8 compounds were as follows:
1.
The compound pool was composed of the 296 compounds that met the criteria established by our docking results and phenotypic AI model filters.
2.
These 296 compounds were clustered using MCS-based clustering with the StarDrop software, using an outlier cutoff threshold of 0.5. This process resulted in 34 clusters and 13 outliers.
3.
We selected the top 10% of compounds from each cluster with better docking scores.
4.
The shortlisted compounds were subject to a manual review step to assess their synthetic feasibility. Ultimately, 8 compounds were selected for synthesis.
Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.