Retrosynthesis prediction with an iterative string editing model

Technical preliminariesChemical reactions involve the participation of reactant molecules, represented by the reactant set R, and the formation of product molecules, represented by the product set P. In the context of this study, we focus on the task of template-free single-step retrosynthesis prediction, which aims to generate the reactant set R corresponding to a given product molecule P, without relying on pre-defined reaction templates or rules. It is important to note that in addition to reactants and products, chemical reactions may involve solvents, catalysts, and reagents. However, for the purpose of this study, we do not consider them in our analysis.We adopt a string-based representation to encode chemical reactions, using a variable-length string that includes a pair of SMILES notations, one for the reactants and the other for the product compound. To formalize the molecular string editing problem, we introduce a Markov Decision Process \(\left({{{\bf{S}}}},{{{\bf{A}}}},{{{\rm{E}}}},{{{\rm{F}}}},{{{{\bf{s}}}}}^{{{{\rm{0}}}}}\right)\). In this formulation, a state \({{{\bf{s}}}}=\left({{{{\bf{s}}}}}_{{{{\rm{1}}}}},{{{{\bf{s}}}}}_{{{{\rm{2}}}}},\cdots \,,{{{{\bf{s}}}}}_{{{{\rm{L}}}}}\right)\in {{{\bf{S}}}}\) is a sequence of tokens, where each token si is drawn from a predefined vocabulary V. The sequence has a length L, and the initial sequence to be refined, i.e., the product string, is denoted by s0. The set of editing actions that can be applied to the sequence is defined as A. The reward function F is defined as the negative value of the distance D between the generated output and the ground-truth sequence, given by \({{{\rm{F}}}}(s)=-{{{\rm{D}}}}\left({{{\bf{s}}}},{{{{\bf{s}}}}}^{*}\right)\). In this setup, an agent interacts with an environment E that receives the agent’s editing actions and returns the modified sequence. The agent’s behavior is modeled by a policy, π: S → P(A), that maps the current generation over a probability distribution over A. At every decoding step, the model receives an input sequence s and selects an editing action a ∈ A to refine it using a policy π, resulting in a new state \({{{\rm{E}}}}\left({{{\bf{s}}}},{{{\bf{a}}}}\right)\), i.e., the intermediates or the reactants. The objective is to optimize the policy π to maximize the cumulative reward obtained throughout the sequence refinement process.Overview of EditRetroOur proposed model incorporates three editing actions, namely sequence reposition, placeholder insertion, and token insertion, to generate reactant strings. It is implemented by a Transformer model consisting of an encoder and three decoders, both of which are composed of stacked Transformer blocks. Our model enhances generation efficiency through its non-autoregressive decoders. Although incorporating additional decoders to iteratively predict editing actions, EditRetro performs editing actions in parallel within each decoder (i.e., non-autoregressive generation). When given a target molecule, the encoder of our method takes its string as input and generates corresponding hidden representations, which are then used as input for the cross-attention modules of the decoder. Similarly, the decoder also takes the product strings as input at the first iteration. During each decoding iteration, the three decoders are executed consecutively as shown in Fig. 1b.

Reposition Decoder: The sequence reposition policy (classifier) πrps predicts a value r for each input position. If the value of r is the index of an input token, the token will be placed at the predicted position. If the value of r is 0, the input token will be deleted. The reposition action involves basic token editing operations such as keeping, deleting, and reordering. It can be compared to the process of identifying a reaction center, involving the reordering and deletion of atoms or groups to obtain the synthons.

Placeholder Decoder: The placeholder insertion policy (classifier), denoted as πplh, predicts the number of placeholders to be inserted between adjacent tokens. It plays a crucial role in determining the structure of the reactants, similar to identifying the locations for adding atoms or groups to the intermediate synthons obtained from the sequence reposition stage.

Token Decoder: The token insertion policy (classifier), denoted as πtok, is responsible for generating candidate tokens for each placeholder. It is essential in determining the actual reactants that can be used to synthesize the target product. This process can be seen as analogous to synthons completion, in combination with the placeholder insertion action.

This iterative refinement process continues until the termination condition is reached. Detailed model architectures and training strategies can be found in Section Methods.Datasets and data preprocessingTo evaluate the effectiveness and performance of our proposed method, we conducted experiments on two widely used benchmark datasets: USPTO-50K43 and USPTO-FULL17. These datasets provide diverse and comprehensive collections of chemical reactions, enabling a thorough evaluation of our model’s capabilities in molecule retrosynthesis. The USPTO-50K is a high-quality dataset that contains ~ 50,000 reactions from the U.S. patent literature. These reactions have accurate atom mappings between products and reactants, and they have been categorized into 10 distinct reaction types, facilitating detailed analysis and comparison with other existing methods. It has been extensively used in previous studies, making it suitable for benchmarking our proposed method against state-of-the-art approaches. For the USPTO-50K dataset, we adopt the same split as reported in Coley et al.15 and divide it into a 40K/5K/5K train/validation/test split. The USPTO-FULL dataset is a significantly larger chemical reaction dataset, comprising ~ 1 million reactions. We partition it into ~ 800 K/100 K/100 K training/validation/test reactions following Dai et al.17. The USPTO-FULL dataset serves as a validation of our model’s performance on a larger and more diverse set of reactions. By conducting experiments on both benchmark datasets, we can assess the performance, generalization ability, and scalability of our proposed method, providing valuable insights into its effectiveness for practical applications in molecule retrosynthesis.To enhance the learning capabilities of the Transformer model and ensure its generalization ability without being overly reliant on the syntax rules of SMILES canonicalization, we employ the SMILES augmentation technique. In line with the augmentation strategy of previous studies22,24,26,37, we conduct 20 times augmentation on both the training and test sets of USPTO-50 K. Similarly, for the USPTO-FULL dataset, we apply 5 times augmentation on the training and test sets. It is important to note that augmentation is only applied to the products in the test sets, whereas it is applied to both the products and reactants in the train sets. The SMILES structures generated through augmentation maintain validity, as they are generated by randomly selecting the starting atom and graph enumeration direction. This random selection ensures that the augmented SMILES structures are chemically plausible and conform to the necessary rules and constraints. In this study, we performed SMILES canonicalization and augmentation using RDKit44.To obtain SMILES fragments, we employ the SMILES Pair Encoding (SPE) method, which enhances the standard atom-level tokenization approach by incorporating human-readable and chemically explainable SMILES substrings as tokens45. This encoding technique allows for more intuitive and interpretable representations of molecules, facilitating the understanding of chemical transformations. We utilized the SPE tokenizer pre-trained on the ChEMBL dataset, as employed in45, for this study. Furthermore, we conduct alignment between the product and reactants SMILES on the training set, following the method proposed by Zipeng et al.26. This alignment process establishes a correspondence between the product and reactant molecules, enabling the model to capture the relationships and transformations between them effectively. By aligning the SMILES strings, we create a connection between the product and reactants, facilitating the learning of retrosynthetic patterns and improving the model’s ability to generate accurate and meaningful reactant strings. The combination of the SMILES Pair Encoding method and the alignment technique enhances the quality and interpretability of the input molecule representations. It enables our model to capture the structural and chemical information present in the SMILES strings and utilize it effectively in the retrosynthesis prediction task.EditRetro generates more accurate reactantsIn evaluating the performance of our proposed EditRetro model for molecule retrosynthesis, we utilize the top-k exact match accuracy as our primary evaluation metric. This metric provides a rigorous assessment by comparing the canonical SMILES of the predicted reactants to the ground truth reactants in the test dataset. By measuring the exact match accuracy, we ensure that the predicted reactants precisely match the ground truth reactants, indicating the model’s ability to generate accurate retrosynthetic predictions. To comprehensively assess the overall performance of EditRetro, we conduct comparative evaluations against a diverse set of state-of-the-art approaches, including template-based, template-free, and semi-template-based approaches. This comparison allows us to gauge EditRetro’s performance in relation to the existing methods and provide insights into its strengths and limitations.The results of top-k exact match accuracy on the USPTO-50K dataset, when the reaction class is not provided, are shown in Table 1. Specifically, EditRetro achieves a top-1 accuracy of 60.8% and a top-3 accuracy of 80.6%. In a more detailed comparison, EditRetro reaches the state-of-the-art performance for template-free methods and exceeds the notable work, i.e., R-SMILES by a margin of 4.5% in top-1 accuracy. Moreover, EditRetro also achieves comparable performance to the baseline models for larger values of k such as k = 5 and 10. We also provide a detailed breakdown of the top-1 exact match accuracy of our model across various reaction types in Supplementary Table 1, revealing that the model’s accuracy varies depending on the specific reaction types. A more detailed analysis can be found in Supplementary Notes.Table 1 Top-k exact match accuracy of the proposed EditRetro and baselines on USPTO-50k dataset with reaction class unknownIn addition to USPTO-50K, we further evaluate the performance of our method on the larger and more diverse USPTO-FULL dataset, which poses additional challenges due to its extensive collection of chemical reactions. As shown in Table 2, our method achieves superior performance to all baselines in top-1 accuracy (52.2%). It should be noted that template-based approaches, which rely on predefined reaction templates, often struggle with generalizing to new reaction templates and handling the vast number of templates present in large datasets19. This limitation inherently affects their performance and scalability. However, EditRetro, being a template-free approach, exhibits competitive performance on larger datasets. This highlights its ability to generalize well to diverse reaction types and overcome the limitations associated with template-based methods.Table 2 Top-k exact match accuracy of the proposed EditRetro and baselines on USPTO-FULL dataset with reaction class unknownThe comprehensive results obtained from our evaluations consistently demonstrate the superior capability of EditRetro in generating high-quality reactants for a given product in retrosynthesis. This superiority can be attributed to two key factors: the close correlation between the three editing stages within a single iteration and the model’s ability to self-correct during iterative refinement. The close correlation between the sequence reposition, placeholder insertion, and token insertion stages within each decoding iteration enables EditRetro to effectively capture complex and diverse patterns in the data. Furthermore, the self-correcting nature of EditRetro during iterative refinement contributes to its high level of accuracy. The model continuously learns from its previous predictions and adjusts its subsequent predictions accordingly. This self-correction mechanism allows EditRetro to refine and improve the reactant generation process, leading to the generation of high-quality and chemically valid reactants.EditRetro achieves superior performance in predicting plausible reactantsAs there may be multiple candidate reactants that can be used to synthesize the same product, we additionally adopt round-trip accuracy to evaluate the model. The roundtrip accuracy was formally proposed in a multi-step retrosynthesis study11 and quantifies the percentage of retrosynthetic predictions considered plausible by the forward prediction model. It is calculated by comparing the given product with the product predicted by a forward reaction model that uses the predicted reactants as input. For this purpose, we utilize the pre-trained Molecular Transformer46 as the oracle forward reaction prediction model, following previous work14,18. We adopt the top-k roundtrip accuracy calculation definition used in Retroformer14: \({{{\rm{RoundTrip}}}}(k)=\frac{1}{N\times k}{\sum }_{1}^{N}{\sum }_{1}^{k}{\mathbb{I}}(\,{\mbox{Reach Ground Truth Product}})\), where N is the number of molecules in the test set. The results of RoundTrip accuracy on USPTO-50K are summarized at the top of Table 3. Our model achieves impressive results, with a top-1 accuracy of 83.4% and a top-3 accuracy of 73.6%. Furthermore, even when considering k values of 5 and 10, our method is comparable to most of the baselines. This demonstrates the capability of EditRetro to effectively learn chemical rules.Table 3 Top-k RoundTrip and MaxFrag accuracy of the proposed EditRetro and baselines on USPTO-50K dataset with reaction class unknownEditRetro accurately predicts the main reactantsIn addition to the roundtrip accuracy, we adopt the MaxFrag accuracy metric22, inspired by classical retrosynthesis, to assess the exact match of the largest fragment. This metric is specifically designed to address prediction limitations caused by unclear reagent reactions in the dataset. The MaxFrag accuracy focuses on evaluating the accuracy of the largest fragment match, providing a more targeted assessment of the model’s ability to predict the main reactant fragment. This metric is particularly valuable in scenarios where the reactant reactions are not explicitly defined or may exhibit uncertainties. By emphasizing the largest fragment, we aim to mitigate the impact of unclear reagent reactions on the overall performance evaluation. The results of top-k MaxFrag accuracy are shown at the bottom of Table 3. EditRetro exhibits superior performance, outperforming all baselines with an accuracy of 65.3% for top-1 predictions and 83.9% for top-3 predictions. Furthermore, when k is equal to 5 and 10, EditRetro’s performance is also slightly better than that of the baselines.EditRetro offers diverse synthetic solutionsDiversity in predicted reactions is crucial for exploring a broader synthesis space and discovering novel chemical pathways. In our inference module, we incorporate reposition sampling and sequence augmentation to enhance generation diversity. It allows for the identification of multiple reaction centers and the consideration of various attachments, enabling EditRetro to generate diverse reactants with distinct scaffolds and structures.To gain a more comprehensive understanding of our model’s predictions, we visually analyze two randomly selected molecules along with the top-10 predictions by EditRetro. The first example, illustrated in Fig. 2a, showcases the synthesis of 5-Bromo-3-(3-pyridinylmethoxy)-2-pyridinamine. EditRetro identifies four distinct reactive sites in this synthesis. The first site corresponds to the oxygen atom, which aligns with the ground truth and includes the top-1, 2, 3, 5, and 7 predictions. The top-1 prediction precisely matches the ground truth, representing a Williamson ether synthesis reaction. Similarly, the top-2 and top-3 predictions involve substituting the chlorine atom with hydroxyl and bromine, respectively. Additionally, the top-5 prediction replaces hydroxyl with bromine. However, the top-7 prediction fails to generate the product within a single step. The second site pertains to the amino group and encompasses the top-4, 9, and 10 predictions. The top-4 prediction leads to product formation through the reduction of nitro compounds to amines. The top-9 prediction involves a Hofmann rearrangement reaction, and the top-10 prediction entails the substitution of aromatic halides with nitrogen nucleophiles. The third site corresponds to the bromine atom and is associated with the top-6 prediction, which represents the halogenation of aromatic compounds. The fourth site involves the nitrogen atom in the pyridine ring. However, its associated top-8 reaction is not plausible for the synthesis of the product. All these predictions, except for top-7 and top-8, have been verified by two chemists and can be successfully reproduced using the Molecular Transformer with high confidence. Therefore, they are considered plausible reactions. It is worth noting that the ground-truth reaction only achieves a maximum yield of 29% as reported in47. Conversely, all plausible predictions indicate significantly higher yields estimated by48, with an average yield exceeding 65%.Fig. 2: Top-10 predictions by EditRetro for two random products from the USPTO-50K test set.a Williamson ether synthesis reaction. EditRetro identifies four distinct reactive sites. All these predictions, except for top-7 and top-9, are plausible for the synthesis of the target product. b Alkylation of amines. EditRetro predicts five distinct reactive sites for the product. All of these predictions, except for top-9, are plausible reactions. Different reactive sites are highlighted with different colors.In Fig. 2b, the second case exemplifies the synthesis of Benzamide, N,N-diethyl-4-[[4-[(4-methylphenyl)methyl]-1-piperazinyl]-8-quinolinylmethyl]-(9CI, ACI). EditRetro identifies five distinct reactive sites for the product. The first site is consistent with the ground truth, including the top-1, 3, 4, and 6 predictions. The top-6 matches the ground truth which involves the alkylation of amines. The top-1, 3, and 7 predictions use different compounds, i.e., the 4-Methylbenzaldehyde, 4-Methylbenzyl chloride, and 4-Methylbenzyl bromide, respectively. Furthermore, the top-4 prediction can yield the product, although with a lower confidence level of 0.46 using Molecular Transformer. The second site encompasses the top-2 and 10 predictions, which uses the amidation reaction. The third site is linked to the top-5 and top-8 predictions, which correspond to the nucleophilic substitution reaction. The fourth site involves the top-7 prediction, corresponding to a carbonyl reduction reaction. The fifth site is associated with the top-9 prediction, which does not produce the desired product. All of these predictions, except for top-9, have been verified by two chemists and can be successfully obtained using the Molecular Transformer.The examples presented above indicate that predicted reactions from EditRetro, rather than ground truth reactions, can still be synthetically valuable and possible. This demonstrates that our method possesses the inherent capability to learn the underlying reaction rules and provide highly rational and diverse predictions. To quantitatively evaluate the diversity of the predictive outcomes, we examine the molecular similarities among them similar to previous work33,38. We calculate the average Tanimoto similarity between each pair of predicted reactants in the top-10 predictions for each product using concatenated ECFP4 fingerprints49. A lower similarity score indicates a higher diversity in the predicted results. Furthermore, we employ the K-means clustering algorithm to group the products based on the similarity of their predicted reactants. As shown in Fig. 3, the predictions in the first four clusters can be considered to have high diversity, as they exhibit lower prediction similarities (0.28, 0.37, 0.41, and 0.46) and account for ~36% of the test set. The predictions in the middle three clusters have medium diversity, as they exhibit average similarities of (0.52, 0.56, and 0.59)and account for nearly 44% of the test set. The predictions in the last three clusters are considered to have relatively low diversity as they exhibit relatively higher prediction similarities (0.63, 0.68, and 0.77). These clusters have a small proportion in the test set and indicate that EditRetro can predict similar reactants in some cases. The average similarity score for the entire test set is 0.55. Overall, these results demonstrate that EditRetro is capable of predicting relatively diverse sets of reactants. We also evaluated the diversity of the predictive outcomes on the larger USPTO-FULL dataset, as demonstrated in Supplementary Fig. 2. The results show that EditRetro continues to exhibit promising diversity on larger and more diverse reaction datasets. Furthermore, we have evaluated the chemical validity rates produced by EditRetro, and the results are presented in Supplementary Table 2 and Supplementary Notes.Fig. 3: Cluster analysis of predicted reactants on the USPTO-50K test set.The values displayed above the bars indicate the average similarity of the predicted reactants in the cluster, with lower values indicating higher diversity among the predicted reactants. The results demonstrate that EditRetro is capable of predicting relatively diverse sets of reactants. Source data are provided as a Source Data file.EditRetro exhibits superior performance on reactions with chirality, ring-forming, and ring-openingChirality is a fundamental property of asymmetry that plays a critical role in stereochemistry and drug discovery. To assess the ability to handle chirality, we compare the performance of EditRetro and a strong baseline method, R-SMILES, on the USPTO-50K test set for reactions with and without chirality. As illustrated in Fig. 4a, when k = 1, EditRetro achieves better results (55.7% and 61.8%) than R-SMILES (51.6% and 56.7%) for both chiral and non-chiral reactions. These results indicate that EditRetro outperforms R-SMILES in handling chirality, demonstrating its ability to accurately predict the correct chiral configurations. Moreover, EditRetro consistently exhibits superior or comparable performance to R-SMILES across different values of k in terms of top-k accuracies. Notably, both methods exhibit better performance on non-chiral reactions than on chiral ones, highlighting the challenges of handling chirality in retrosynthesis prediction. We attribute EditRetro’s superiority in handling chiral reactions to its edit-based generation approach. In a chemical reaction, products and reactants may share several substructures. By generating reactants based on the product’s structure, EditRetro facilitates accurate prediction of chirality during the generation process. These results demonstrate the efficacy and robustness of our edit-based method for retrosynthesis prediction.Fig. 4: Top-k performance of EditRetro and a string-based baseline R-SMILES on complex reactions.a Reactions with/without chirality. Both methods show inferior performance on reactions with chirality compared to those without chirality, highlighting the challenge of chiral reactions. EditRetro exhibits superior performance than the baseline in most cases. b Non-ring, Ring-opening, and Ring-forming reactions. EditRetro demonstrates superior performance over the baseline in most cases for non-ring reactions and consistently exhibits better performance for ring-opening and ring-forming reactions.Ring-forming and ring-opening reactions are both essential transformations in organic synthesis, with significant theoretical significance and broad practical applications. Ring-forming reactions allow the synthesis of various cyclic compounds, while ring-opening reactions, such as the epoxide ring-opening reaction, are crucial steps in the synthesis of many organic compounds. To assess the model’s capability for predicting these types of reactions, we compare the performance of EditRetro and the baseline R-SMILES on non-ring, ring-opening, and ring-forming reactions. As depicted in Fig. 4b, both models demonstrate better performance on ring-opening and ring-forming reactions. This observation indicates the inherent challenges associated with predicting these specific types of reactions. However, EditRetro consistently outperforms or performs comparably to R-SMILES on all types of reactions. Particularly, EditRetro shows significant improvements over R-SMILES for ring-opening and ring-forming reactions. For example, when k = 1, EditRetro outperforms R-SMILES by 5.9% for ring-opening reactions and 5.8% for ring-forming reactions. These results further confirm the superiority of our edit-based generation approach over methods that generate structures from scratch. EditRetro leverages the existing product structure to guide the generation of reactants, enabling it to capture the specific requirements and transformations involved in ring-opening and ring-forming reactions. We also compared the effects of the SPE tokenizer with the token-wise tokenizer for our model in Supplementary Fig. 5.Visualization of reasoning processIn this subsection, we present empirical examples to demonstrate our model’s reasoning process and its iterative refinement capability. Our model performs explicit Levenshtein string editing operations in each decoder step, allowing chemists to easily understand the generation process. This transparency enhances trust and utility in the generated results. Additionally, the parallel execution of string editing operations facilitates the efficiency, enabling faster generation and scalability for larger datasets and complex molecules. Moreover, it features the capability of iterative refinement, which allows for self-correction and improvement in subsequent iterations. To evaluate the impact of iterative refinement, we analyze the distribution of refinement iterations for correctly predicted reactions in the test set of USPTO-50 K, focusing on the Top-1 exact match accuracy.As depicted in Supplementary Fig. 3, the analysis reveals that the majority of reactions in the test set achieve accurate predictions after just one refinement iteration, accounting for 80.18% of cases. This highlights the strong performance of our model in generating correct reactants in the initial iteration itself, ensuring efficiency and practicality in real-world applications. For a smaller proportion of reactions, additional refinement iterations are necessary to achieve accurate predictions. There are 335 (10.67%) of cases that require two refinement iterations, while 64(2.04%) and 29(0.92%) of cases require three or more iterations, respectively. These instances represent scenarios where the initial prediction may not fully capture the complexity of the chemical transformations or where further optimization is required for accurate reactant generation. Overall, the distribution of refinement iterations demonstrates that our method achieves high accuracy with efficient generation processes, aided by its iterative refinement capability. We conduct a quantitative evaluation of the inference latency of our model, and the results are presented in Supplementary Table 3.To gain insights into the reasoning process of our model, we randomly select 3 reactions with different reaction types from the test set of USPTO-50K and visualize the generation process. These examples provide a deeper understanding of how EditRetro generates reactants and demonstrate its robustness in iteratively refining its predictions. The first example in Fig. 5a depicts a Wohl-Ziegler bromination reaction, which involves the allylic bromination of hydrocarbons using N-bromosuccinimide and a radical initiator. EditRetro accurately predicts that all atoms are from the reactants without the need for any reposition operations. During the insertion stage, it accurately predicts the placeholders and generates the final two ground-truth reactants with high probability. EditRetro achieves this by combining several substructures to form the Ethyl 1-(2,4-dichlorophenyl)-5-(4-methoxyphenyl)-4-methyl-1H-pyrazole-3-carboxylate (ACI) and generating N-bromosuccinimide from scratch. Figure 5b presents the second example, which showcases two different reactions that can be used to synthesize butyl cinnamate. The top-ranked prediction involves the esterification reaction. EditRetro successfully obtains the two reactants by removing the n-Butane fragment and inserting a 1-Butanol fragment while concatenating other fragments in a successive manner. The second-ranked reaction is a Heck reaction. EditRetro identifies another reaction center using a variant of the canonical SMILES and performs deletion and reordering operations on the fragments. This is followed by inserting corresponding fragments to obtain the ground-truth Butyl acrylate and Bromobenzene. This example showcases EditRetro’s ability to handle diverse reaction types and generate reactants through a combination of editing operations. The third one depicted in Fig. 5c is a nucleophilic addition reaction with two iterations. In the first iteration, EditRetro accurately identifies the reaction center but obtains an invalid molecule and an unavailable molecule. In the subsequent iteration, it continues to refine the intermediate molecules based on the predicted reaction center and successfully obtains the ground truth molecules, 4-Hydrazinylbenzonitrile and 4-Oxocyclohexyl benzoate, with a relatively high probability. This example highlights the robustness of our model in iteratively refining its predictions and correcting any self-generated errors.Fig. 5: Retrosynthesis reasoning process by our model.a Wohl-Ziegler reaction. EditRetro reliably identifies the placeholders and tokens to generate the ground truth reactants in one iteration. b Esterification reaction and Heck reaction. EditRetro provides two different reactions to synthesize butyl cinnamate. c Nucleophilic addition reaction. During the first iteration, EditRetro generates an invalid molecule and an unavailable molecule. However, in the subsequent iteration, it is capable of detecting and self-correcting the incorrect generation. The P denotes the probability of the model’s prediction. The orange, cyan, yellow, and green boxes correspond to the operations of deletion, reordering, placeholder insertion, and token insertion, respectively. The [No Operation] and [Terminate] states are determined by the model. The corresponding molecule graph editing process is illustrated at the bottom of each example for reference. The operations are highlighted in different colors: , , , and .Analysis of incorrect predictionsAs analyzed in Fig. 6, EditRetro sometimes produces incorrect predictions. To deeply analyze the incorrect predictions, we first conduct a comprehensive performance comparison by evaluating the four different error categories proposed by the baseline model MEGAN in Supplementary Fig. 4. We observe that EditRetro’s top-1 predictions for the first, second, and fourth error categories, i.e., only possible in multiple steps, low yield or side products, and a reactive functional group ignored, are entirely consistent with the ground-truth reactants. Regarding the third error category, namely incorrect chirality, our method also predicts the ground-truth reactants as the top-2 ranked choices. Furthermore, the top-1 prediction is also chemically feasible, identifying another reaction center but correctly handling the chirality. These results further reinforce the high accuracy and reliability of our model.Fig. 6: Examples of top-10 prediction by EditRetro for different errors.a Redundant reactant molecule, (b) Discrepancy in reactive sites, and (c) Chemically infeasible reaction. The atoms highlighted in red indicate the reactive sites.To further investigate the incorrect predictions, we present three instances of inaccurate reactions by EditRetro within top-10 predictions in Fig. 6. In the first example shown in Fig. 6a, EditRetro accurately identifies the reactive site and produces three reactants. Among these, two molecules align with the ground truth, while one is redundant and does not participate in the reaction. In Fig. 6b, EditRetro correctly identifies the reactive site in accordance with the ground truth and generates two molecules, out of which one aligns with the ground truth. However, the two molecules are incapable of producing the desired product due to the discrepancy between the reactive site of the molecule Cc1ccc(S(=O)(=O)OS(=O)(=O)[N+](C)(C)C)cc1 and the ground truth molecule. Moreover, this molecule is not available in CAS SciFindern 50 and poses challenges for synthesis. Finally, EditRetro sometimes generates chemically infeasible reactions, as seen in Fig. 6c, where the two molecules are generally unable to react. This indicates a limitation in our model’s ability to accurately assess the feasibility of certain reactions. These examples provide valuable insights into potential areas for improvement in our model. To rectify these inaccuracies, potential future improvements could involve integrating chemical modules capable of determining the reactivity of different reactive sites.EditRetro demonstrates practical utility in multi-step synthesis planningTo assess the practical utility of our one-step prediction method in synthesis planning, we extend EditRetro, which was trained on the USPTO-50K dataset, to enable the design of complete chemical pathways through sequential retrosynthetic predictions. We select four target compounds of significant medicinal importance for our evaluation: Febuxostat51, Osimertinib52, an Allosteric Activator for GPX47, and a DDR1 kinase inhibitor INS015_03753.Febuxostat, which is presented as the first example in Fig. 7a, is a medication for treating gout that selectively inhibits xanthine oxidase without affecting purine synthesis. Our method accurately predicts a three-step pathway for febuxostat, which is identical to the previously reported pathway by Cao et al.51. The first step involves ester hydrolysis, followed by the Suzuki cross-coupling reaction between 3-cyano-4-isobutoxyphenyl boronic acid and ethyl 2-bromo-4-methyithiazole-5-carboxylate as reactants. The second example is a third-generation EGFR inhibitor Osimertinib for non-small cell lung carcinoma treatment, as illustrated in Fig. 7b. The complete five-step synthesis pathway of this drug was proposed by Finlay et al.52, utilizing easily accessible or obtainable starting materials. In the synthesis pathway for Osimertinib suggested by our model, the first step involves an acylation reaction using acryloyl chloride. Subsequently, the model accurately predicts the reduction of the nitro group in the second step. In the following two steps, the model suggests sequential nucleophilic aromatic substitution reactions (SNAr) to introduce the amino side chain and nitroaniline. Notably, our model deviates from the Friedel-Crafts arylation reported in the literature and instead proposes a Suzuki cross-coupling reaction in the final step, which is consistent with the baseline Graph2Edits approach, to generate 3-pyrazinyl indole. The third example is an allosteric activator of glutathione peroxidase 4 (GPX4), and its synthetic pathway illustrated in Fig. 7c is reported by Lin et al.7. They predicted the synthetic pathway by enumerating different reaction types with a template-free model. Nevertheless, our method successfully predicts all five reaction steps among the top four predictions, even without considering the reaction type, which directly highlights the superiority of our approach. The fourth example presents a challenging but interesting retrosynthesis task for the DDR1 kinase inhibitor INS015_037 as shown in Fig. 7d. INS015_037 is a potential DDR1 kinase inhibitor designed using generative machine learning methods, which has been experimentally demonstrated to exhibit favorable pharmacokinetics in mice. Using a convergent synthesis approach, Zhavoronkov et al.53 separately synthesized two precursors at first and then synthesized INS015_37 in the final step. Our model accurately predicts the convergent synthesis pathway with a high-ranked prediction, which is consistent with the reported synthesis pathway.Fig. 7: Multistep retrosynthesis predictions by EditRetro.a Febuxostat. b The third-generation EGFR (Epidermal Growth Factor Receptor) inhibitor Osimertinib. c An allosteric activator for GPX4. d A DDR1 (Discoidin Domain Receptor 1) kinase inhibitor INS015_037. Distinct colors are used to clearly distinguish the reaction center, as well as the atom and bond transformations, in each step of the reaction. The retrosynthetic pathways generated by EditRetro for four examples closely align with those reported in the literature, with the majority of predictions ranking within the top two. It demonstrates the practical utility of EditRetro in synthesis planning.All four demonstrated examples yield retrosynthetic pathways that closely align with those reported in the literature, with the majority of predictions ranking within the top two. Among the 16 individual steps considered, ten are accurately predicted at rank-1, with the remaining steps predicted at rank-2, -3, -4, -6, and -7. These results underscore the practical potential of our model for practical retrosynthesis predictions. By providing valuable insights and facilitating the design of efficient synthesis routes, our method holds promise for practical applications in the field of retrosynthesis planning.

Retrosynthesis prediction with an iterative string editing model

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis