Improving docking and virtual screening performance using AlphaFold2 multi-state modeling for kinases

Structural state distributions of experimental kinase structures and AlphaFold2 predicted modelsThe conformational change of a protein structure could be an obstacle to SBVS. The active site of kinases forms diverse conformations depending on the type of binding compound. For example, the DFGin state binds to type I inhibitors, while the DFGout state binds to type II inhibitors. Dunbrack and his colleagues introduced a classification rule for kinase structures called KinCoRe11. This rule categorizes kinase structural states into 12 types based on the spatial state of the activation loop and the dihedral angle of the DFG motif. Details of the criteria can be found in the Methods section and in Modi et al.11.The blue bar in Fig. 2 shows the distribution of kinase conformational states annotated by the KinCoRe scheme11 in the PDB database. More than half (53.6%) of experimentally determined kinase structures are classified as having the DFGin-BLAminus conformation. Details of the structure distribution are shown in Supplementary Table S1. Other than this major state, the remaining states each occupy less than 10% of PDB structures. Fig. 2Distribution of crystal structures and standard AF2 structures for each kinase conformation. The X-axis is conformational states annotated by KinCoRe and the y-axis is the percentage of each state. All human crystal kinase structures deposited in RCSB-PDB, DUD-E target protein crystal structures, predicted AF2 models with default parameters (standard AF2), and AF3 are colored blue, orange, green, and red, respectively.The thermodynamic stabilities of the conformational states might influence the preference for the DFGin state in experimentally determined structures. Meng et al.13 studied the transition between the structural states of c-Abl and c-Src kinases using umbrella sampling and the potential of mean force. With a difference of 1.4 kcal/mol, the DFGin state of c-Abl has a lower Gibbs free energy than the DFGout state. Their calculations show that the active conformation (DFGin) is the dominant population, while the DFGout state occupied only 9% in their simulation. Similarly, for c-Src, the DFGin state is more favored than the DFGout state. The free energy difference between the states is calculated as 5.4 kcal/mol. The authors also studied the thermodynamic barrier of the transition between the states14. It is estimated to be on the order of 2–4 kcal/mol, making the transition from the highly populated DFGin state to the DFGout state difficult. Levy and his colleagues15 collected 2,896 kinase structures and multiple sequence alignments and applied a Potts model to predict structural propensities from sequences. With this statistical potential, most kinases are predicted to have preferences for the DFGin state. The penalty for forming the DFGout state reaches 2–3 kcal/mol in extreme cases.Not only thermodynamic preference, but also the distribution of kinase inhibitor types might affect the skewness of kinase structures. We counted the number of compounds in kinase inhibitor types from PKIDB (assessed August 2023)16. PKIDB is a curated database of kinase inhibitors in clinical trials. Out of 369 compounds in the database, only 84 molecules have their inhibition type annotated, because annotating the ligand type requires the complex structure. Type I inhibitors (DFGin bound) are the dominant form, occupying 66% (55 compounds) of the annotated inhibitors. In contrast, 17 molecules are labeled as type II (DFGout bound). Thus, the DFGin conformation (active state) might have a higher chance of being crystallized than the DFGout state. The biased conformational states of kinases would make it harder to discover chemically diverse kinase inhibitors.We also examined the conformational state distribution of 25 kinase DUD-E targets17, benchmarked throughout this study, in the PDB database (Fig. 2, orange bar). DUD-E is a widely used benchmark dataset for evaluating virtual screening methods, composed of pharmaceutically important targets such as GPCRs and kinases. The experimental structures are less biased than all PDB structures; about 30% of DUD-E proteins have the DFGin-BLAminus conformation. The other states tend to have a higher proportion than in the all-human kinase structure distribution. The difference in state distribution between all kinases and DUD-E targets potentially indicates that the highly biased nature of kinase structures might not be suitable for discovering diverse types of kinase inhibitors.From an MSA of a given sequence, AF28 extracts coevolution information using the MSA Transformer and predicts the three-dimensional structure based on this information and deep-learning models trained on existing protein structures. We modeled 25 kinase catalytic domains provided in the DUD-E kinase subset with the default parameters of AF2 (standard AF2), resulting in 125 structures (five models per target), then assigned the conformational states of the models using KinCoRe. The average plDDT and MolProbity18 scores of the predicted structures are 89.38 and 1.04, respectively, implying that they are properly modeled. Out of 125 predicted models, 91 structures (72%) are annotated as having the DFGin-BLAminus conformation (Fig. 2, green bar). The distribution of predicted models is more skewed than all human kinase structures and the DUD-E set proteins. However, the other states have a lower proportion than the experimental structures. The standard AF2 did not produce DFGin-ABAminus, DFGin-BLBtrans, DFGinter-BABtrans, and DFGout-Unassigned conformations, which occupy a small portion of the human kinase PDB structures (7.3%, 3.0%, 0.3%, and 3.2%, respectively, Supplementary Table S1). Since the experimentally determined kinase structures are biased towards DFGin-BLAminus, predicted kinase structures by AF2 might have a tendency to produce biased conformations, which is also observed in previous research10. In addition to the biased trained models, the template selection process in AF2 modeling does not take the structural state of the kinase into account.AF3 also shows a similar trend to AF2. The red bar in Fig. 2 shows the distribution of conformational states of AF3 structures. About 78% of AF3 structures were classified as DFGin-BLAminus, which is the major state in both crystal structures and AF2 modeled structures. This suggests that AF3 could not fully explore the diverse conformational landscape of kinases, despite the modeling part being replaced with a diffusion model.Predicting state-specific kinase structures using multi-state modeling protocolThe MSM AF2 protocol provides conformational state-specific structures as templates for structure prediction. All human kinase structures from the KLIFS database19 were classified by their structural state following KinCoRe rules11. For each state, the five structures with the highest sequence similarity to a query sequence were selected as templates for modeling. AF2 was then executed with this template information to predict five models for each template. Models with conformational states different from the given template structures were removed, and the model with the highest plDDT was selected for our benchmark. The overall workflow is illustrated in Fig. 3. Details of the kinase MSM protocol are elaborated in the Methods section. Fig. 3Workflow of multi-state modeling of kinases.We prepared two template sets: a trivial template (TT) set containing 100% sequence-identical structural templates, and a nontrivial template (NT) set not having identical proteins. With our MSM protocol, AF2 could model structures in specific states, producing on average 8.5 and 8.2 models with the TT set and NT set, respectively. The number of predicted models for each target ranges from three (WEE1) to 11 (ABL1 and LCK, Supplementary Table S2). The average plDDT scores of structures are 89.63 and 88.08 for models from the TT set and NT set, respectively, showing comparable values to the standard AF2 predictions (89.38). As expected, TT set models have higher accuracy on average than NT set models, but their difference is marginal. The distribution of the MolProbity score is given in Supplementary Fig. S1.The quality of models for each structural state was also examined. The average plDDT values range from 86.62 (DFGin-BLAplus) to 94.48 (DFGin-BLAminus), indicating that the quality of the models does not depend heavily on the structural state of kinases. The highest plDDT score for DFGin-BLAminus might be due to the highest frequency of this state in the PDB database (Supplementary Table S1). The Pearson’s correlation coefficient between the average plDDT and percentage in the PDB structure of each structural state is 0.739. Even though the MSM protocol provides a structural template for AF2, the model quality might be influenced by AF2’s pre-trained models, so the average plDDT of each model follows the distribution of the crystal structures.The TT set models and the NT set models have average MolProbity scores of 1.21 and 1.24, respectively. These figures represent a modest decline in quality compared to the standard AF2 models (1.04). We also examined each structural state’s average MolProbity score. The lowest value is 1.06 (DFGin-ABAminus), and the highest is 1.33 (DFGinter-Unassigned). For DFGin-BLAminus, the most populated state for both PDB and standard AF2, MSM models show an average MolProbity score of 1.08. The average MolProbity score and the distribution in the PDB structure have a Pearson’s correlation coefficient of -0.59, indicating a weak negative correlation. The high MolProbity scores might be associated with the states that are less populated in the PDB.To investigate the accuracy of the models, we measured the TM-Score20 of predicted models against crystal structures given in the DUD-E set, referred to as ‘reference structures’ throughout this paper. TM-Score assesses structural similarity between two given proteins, ranging from zero to one (identical). For MSM models, we used the predicted structures in the same structural state as the reference. The average TM-Scores are 0.92 and 0.90 for the models predicted using TT and NT sets, respectively. Compared with standard AF2 models (average TM-Score: 0.87), the MSM technique provided more accurate models than the standard AF2 protocol, which is expected since MSM provides structural templates. Models with TT sets generally have more accurate structures than those with NT sets, as also expected.Consequently, by utilizing structures that represent a variety of states for the target kinase, the MSM could generate diverse structures as desired with high accuracy. Thus, it might provide an appropriate structure set for kinase ensemble SBVS.Cognate docking accuracy of a compound to the multi-state modelled structuresTo examine whether modeled structures are suitable for molecular docking and thus structure-based virtual screening, we first conducted a cognate docking experiment on crystal structures, standard AF2, AF3, and MSM models. Ligands from the 25 reference structures were used for this benchmark. For the standard AF2 model, the highest plDDT model for each protein, which is also used for performing virtual screening in the next section, was selected for evaluation. The same selection rule was applied to AF3. For evaluating the MSM model, we used the structure with the same KinCoRe annotation as the reference structure. Since the crystal structure of IGF1R was not assigned any of the structural states by KinCoRe, it was removed, reducing the number of target proteins to 24. AutoDock-GPU21 was employed to predict 50 binding poses. The RMSDs of all predicted poses to the crystal binding pose were calculated. For analysis, we considered three values: the RMSD of the best AutoDock score pose, that of the closest pose to the crystal binding mode, and the average RMSD of all 50 poses.Table 1 summarizes the docking accuracy evaluation results. To define docking success, an RMSD cutoff of 2.0 Å, a standard criterion for judging docking success22,23,24, was applied. Among the benchmarked structures, the crystal structures provided in the DUD-E set showed the best performance for both RMSD values and number of successful cases, as expected. Table 1 Docking accuracy benchmark result. The values are the average values of 24 proteins and the numbers in the parentheses are the number of success cases with RMSD < 2 Å.Considering the best AutoDock score models, the standard AF2, AF3, MSM structures modeled with TT, and NT have average RMSDs of 2.74 Å, 3.62 Å, 2.15 Å, and 3.49 Å, respectively. In addition, the number of successful cases is 11 (standard AF2), 7 (AF3), 16 (with TT), and 9 (with NT) out of 24 receptors. Individual RMSD values are given in Supplementary Table S3.Comparing the MSM models with AF2 and AF3 models, the multi-state models with TT sets have the most accurate docking poses. As observed in the previous section, template information influenced the quality of the docking poses; i.e., the predicted docking poses of the models with TT sets have smaller RMSDs and more successful cases than those with NT sets. One successful example is AKT2 (Fig. 4). Although the TM-scores of MSM with TT set (0.98) and standard AF2 (0.96) are similar, the best AutoDock score model docked to the MSM using TT sets has a more accurate binding pose than standard AF2, with RMSDs of 0.86 Å (Fig. 4B) and 3.24 Å (Fig. 4C), respectively. Fig. 4Predicted cognate docking structures of AKT2. (A) Superimposed structures of crystal structure (PDB ID: 3D0E, grey), MSM model (DFGin-BLAminus, magenta; TM-Score: 0.98), and standard AF2 (cyan; TM-Score: 0.96). (B) Predicted docking poses of the cognate ligand on the AF2 with MSM. The RMSD of the predicted pose is 0.86 Å. (C) Predicted docking poses of the cognate ligand on the standard AF2. The RMSD of the predicted pose is 3.24 Å.Even when docking is not successful (RMSD > 2 Å), MSM structures were able to retrieve the interactions between protein and ligand found in the crystal structure for some cases. Out of eight cases, the interacting residues in the reference structures were successfully captured more than 50% in five proteins analyzed by PLIP25 (Supplementary Table S4). For example, the predicted binding pose of the best AutoDock score conformation of PRKCB cognate docking had an RMSD of 2.79 Å when the MSM with TT structure was used as the receptor structure. However, out of eleven interacting residues identified in the reference structure, ten residues were retrieved in the MSM-ligand complex model. By analyzing the interaction pattern, the predicted docking pose to the multi-state modeled structure shows hydrophobic interactions with L348, F353, V356, A369, K371, A483, and D484 and hydrogen bonding with T404, E421, and V423. These interactions are also observed in the crystal structure (Supplementary Fig. S2). The cognate docking benchmark results suggest that MSM models could be used to predict binding poses of kinase-ligand complexes, thus making them suitable for virtual ensemble screening.Similar trends are observed for the lowest RMSD conformations and average RMSD of 50 conformations. Among the predicted structures, MSM with TT shows the most accurate models (average of the lowest RMSD pose: 1.40 Å, success cases: 19) and the standard AF2 performs slightly lower (average of the lowest RMSD pose: 1.50 Å, success cases: 20). AF3 also showed worse performance, with an average RMSD of 2.01 Å and 14 success cases. The docking accuracy of MSM with NT sets is the worst among the receptor structure sets (average of the lowest RMSD pose: 2.14 Å, success cases: 14). In most cases, the best AutoDock score conformations do not match the lowest RMSD conformations, which means that the AutoDock score could not find the optimal docking poses. The average RMSD of 50 conformations follows the same order as the other two metrics: MSM with TT is the smallest, and MSM with NT is the highest.Virtual screening performance with multi-state modelsTo investigate the advantage of using the MSM technique for SBVS, the compound library for each kinase protein from DUD-E was docked to structures with diverse states generated using our method and compared to crystal structures, standard AF2, and AF3 structures, using AutoDock-GPU. For ensemble docking using models generated by MSM and AF3, since a molecule was docked to multiple receptor structures and thus had multiple docking scores, we needed to decide on representative scores of the compounds to rank the molecules. We employed two representative scores: the AutoDock best (ADB) and the Boltzmann-weighted (BW) scores. The ADB score picks the lowest AutoDock score among the docked results, while the BW score is a weighted average of all scores. The details of the BW score are given in the Methods section.Table 2 shows the performance of SBVS using the various receptor structure sets. To evaluate the performance, five metrics are used: enrichment factors at 1%, 5%, and 10% (EFX%), area under ROC curve (AUC), and Boltzmann-enhanced discrimination of receiver operating characteristic (BEDROC). EF at X% indicates the capability of finding active molecules within the top X%, and AUC represents the discrimination power of a screening method between active and decoy molecules. BEDROC puts exponential weights on the early rank of molecules, thus it can solve the ‘early recognition problem’ caused by AUC26. Details of the metrics are illustrated in the Methods section. As observed in the cognate docking benchmark, crystal structures showed the best performance in most metrics. Also, MSM structures are better than or equal to standard AF2 and AF3 results, regardless of the scoring method or the template set for modeling. Details of individual results are provided in Supplementary Table S5. Comparing EF1% values target-by-target, the MSM performed better than or equal to standard AF2 in 18 proteins out of 25 targets (72%) using the ADB score screened to the structures modeled with the NT set. Surprisingly, the combination, NT models plus ADB, also performed better than the crystal structures. Even with the lowest EF1% combination, TT models with BW scoring, the MSM performed better than or equal to standard AF2 in more than half of the proteins (13 proteins). Table 2 Performance of structure-based virtual screening on various receptor models.To assess the statistical significance of our findings, we performed a Wilcoxon signed-rank test, a non-parametric method, across all receptor structure pairs and metrics (Supplementary Table S6). While the results did not reach the conventional threshold for statistical significance (p < 0.05), EF1% of NT models with ADB score showed relatively a lower p-value with standard AF2. The relatively small sample size (n = 25) may have limited our ability to detect subtle differences between methods. Further investigation with larger datasets could be beneficial to more fully understand the comparative performance of these methods.One of the interesting findings is that although the TT set models with the same KinCoRe classification as the reference have more accurate structures and docking poses than the NT set models, the virtual screening performance is slightly worse in both scoring schemes. To elucidate the variance in performance between MSMs using TT and NT models, we identified kinases that exhibited notable differences in EF1% between the two sets. Of the kinases studied, ABL1 demonstrated superior performance using the MSM with the NT set compared to the TT set, across both ensemble scoring methods (Supplementary Table S5). We measured TM-Scores between the templates from TT and NT sets to model the same structural state and EF1% of the models (Supplementary Table S7). Among the kinase states analyzed, the DFGout-BBAminus for both proteins produced a significant difference in both templates and models. When the kinase forms an active state, the activation loop forms an extended conformation to facilitate the catalytic function of the protein, thus the DFGin conformation has a conserved structure. On the other hand, for the DFGout conformation, the activation loop collapsed on the protein surface with high flexibility11. Therefore, although the protein structures have the same KinCoRe notation with DFGout, the activation loop can have different structures. The TM-Scores of the templates to model the DFGout-BBAminus state are 0.88 for ABL1. The structural difference of templates influenced the predicted models, resulting in TM-Scores of 0.89 between predicted models from TT and NT sets. Supplementary Fig. S3 shows a structural difference of ABL1 predicted models with DFGout-BBAminus conformation. The difference in activation loop location also might affect the virtual screening performance, leading the TT set (3.28) to have a lower EF1% value than the NT set (15.83). Significantly, this difference in performance in the DFGout-BBAminus state impacted the overall ensemble docking results.One benefit of using the MSM models is that the predicted models are diverse, so they are potentially useful for discovering various scaffolds of hit chemicals. To examine whether this hypothesis is true, we divided the benchmark set into two groups based on the diversity of active compounds in the screening library. The pairwise Tanimoto coefficients (Tc) between the active compounds were calculated using RDKitFP fingerprint27. Then the pairwise distances between the compounds were calculated as (1 – Tc). The diversity of active compounds is defined as the average distances of the active compounds (Supplementary Table S5). With a threshold of 0.7, the kinases were classified into two groups: 14 proteins with values higher than or equal to the cutoff and the remaining 11 targets.For the kinases with more diverse active compounds, the ensemble screening with MSM models performed better than standard AF2 and AF3 models in all metrics, and even higher EF1% than the crystal structures. For EF1%, the MSM performed much better than standard AF2 (Table 2). Out of 14 proteins, the MSM with NT set and BW scoring has higher or equal EF1% values than standard AF2 models in 11 targets. On the other hand, when the active compounds become less diverse, MSM still performed better than the standard AF2, but the gap between them declined for EF1%. This implies that our approach would be powerful for discovering diverse molecular scaffolds.To investigate further, we examined the diversity of active compounds identified as top hits by calculating Shannon’s entropy (Supplementary Table S8). The active molecules in the DUD-E set were clustered using the Butina algorithm28 with a distance cutoff of 0.3. The higher Shannon’s entropy means that more diverse active compounds were found. The NT models with BW score show the highest Shannon’s entropy within the top 1% (1.36). Other AF-based models have Shannon’s entropy around 1.0 (1.08, 1.01, and 1.15 for standard AF2, AF3 with ADB, and AF3 with BW). Regarding the kinases with average dissimilarity ≥ 0.7, the NT models with ADB show the highest Shannon’s entropy within the top 1% (1.60), and the gap between MSM protocol with other receptor structures increases. This result supports that the use of multiple-conformation ensembles generated by our protocol would be suitable for discovering diverse hits.One of the diverse active hit examples is ABL1. The average dissimilarity of the active compounds is 0.73. Regardless of the template set and the scoring scheme, EF1% of MSM ensemble screening showed higher performance (TT models with ADB: 9.3, TT models with BW: 10.4, NT models with ADB: 15.3, and NT models with BW: 17.5) than standard AF2 model (6.0), AF3 (7.1 for ADB and 7.6 for BW), and crystal structure (7.6). We also examined the diversity of active compounds ranked within the top 1% ranked molecules. Our ensemble protocol tends to find diverse hits: 0.51, 0.53, 0.63, and 0.62 for TT models with ADB, TT models with BW, NT models with ADB, and NT models with BW, respectively. In contrast, the diversity of active compounds using the standard AF2 is 0.27. Supplementary Fig. S4 illustrates the distribution of active compounds within the top 1%. The ABL1 crystal structure got 16 active compounds from 6 clusters, while MSM with NT plus BW contained 32 active compounds from 16 clusters.CSF1R is another example showing MSM ensemble screening can find diverse hit compounds. The average dissimilarity of hit molecules is 0.73. The gap in EF1% between MSM models (5.4) and the standard AF2 model (3.6) is smaller than in the case of ABL1. The diversities of discovered hit compounds within the top 1% are 0.70, 0.65, 0.73, and 0.73 in the order of TT models with ADB, TT models with BW, NT models with ADB, and NT models with BW, similar to the average dissimilarity of all active compounds. However, the diversity of hit compounds within the top 1% identified using the standard AF2 model is 0.23. Figure 5 shows the docking pose of one of the active compounds, ChEMBL245377. The MSM ranked the active compound within the top 1% (10th using TT models and BW scoring, out of 12,316 compounds) which is not ranked highly by the standard AF2 structure (167th ). The compound is not successfully docked using the standard AF2 model. Fig. 5Predicted binding poses of ChEMBL245377 and CSF1R structures. (A) The superimposed complex structures of ChEMBL245377 and CSF1R. The crystal structure of CSF1R (PDB ID:3KRJ), standard AF2 predicted structure, and MSM model (DFGout-BBAminus, with TT sets) are represented as ribbon diagrams colored as gold, pink, and orange, respectively, while the predicted docking poses of the compound are represented as sticks. (B, C). Focused binding sites and docking poses of ChEMBL245377. The compound is ranked 10th by ensemble docking (B) and 167th in docking for the standard AF2 predicted structure (C).Although the ensemble screening with MSM structures generally performed better than standard AF2, there is an issue with selecting the representative score of a compound. For instance, for FGFR1, the MSM model with TT set and BW score showed an EF1% of 2.1. When we observed EF1% values of individual models, the DFGin-BLBplus state outperformed any other structures including standard AF2 and crystal structure (Table 3). Thus, a proper method for selecting or calculating representative scores for a compound should be designed to achieve high performance for MSM ensemble screening. Table 3 EF1% for FGFR1 structures of specific states. The MSM structures are built from TT set.

Hot Topics

Related Articles