Teaching old docks new tricks with machine learning enhanced ensemble docking

Methodological detailsInput format and optionsAs input, EnOpt accepts a CSV-formatted \(n \times m\) matrix of docking scores, organized as a set of n compounds (rows) docked into m target-protein conformations (columns). All compounds should be labeled uniquely, and a separate CSV file should list the known ligands that will serve as positive controls. This format allows users to easily import compounds and scores from different sources with different naming standards. The git repository (https://github.com/durrantlab/EnOpt) provides example input CSV files to illustrate the necessary format.Model selection and hyperparametersEnOpt trains a machine-learning model to predict the probability that a given compound is active. By default, it uses gradient-boosted trees, as implemented in XGBoost11. Users can also select random forest, as implemented in scikit-learn12. We incorporated these two methods because others have applied them successfully in the context of ensemble docking5. EnOpt’s default hyperparameters for model fitting mostly align with these two packages’ default values. Specifically, the learning rate is set to 0.3, and the maximum tree depth is set to 6. Both alpha (the L1 regularization term) and gamma (the minimum loss reduction required to split a leaf) are set to 0. By default, all features are used in the generation of each new tree node (per the default values of the colsample_by_node and max_features parameters in XGBoost and scikit-learn, respectively).EnOpt’s default number of estimators is set to 15, the lowest value that maintained satisfactory performance in our examples. We chose this low value to avoid overfitting and ensure generalizability. We acknowledge that random forest models are generally robust against overfitting because they rely on large ensembles of trees13, which mitigates the impact of individual tree variability by aggregating predictions. Nevertheless, we thought it best to exercise caution when making predictions within a dataset that has so few positive controls. We expect that models trained with a larger number of estimators may perform better in some cases, as shown by Ricci et al.5. However, given that compound libraries often include few positive controls (known ligands), we worry that using a much larger number of estimators would risk overfitting, biasing the model towards the included known ligands.Users may specify different hyperparameters via a JSON input file. If users are uncertain which specific hyperparameters are best suited to their system, they can also use EnOpt’s built-in hyperparameter-optimization feature. See the documentation for more information, and Supplementary Table S1 for the optimized XGBoost hyperparameters used in the present study.Cross-validation trainingEnOpt generates 3-fold cross-validation training and testing sets from the user’s input data. Specifically, the user’s data is divided into three non-overlapping parts with equivalent known/unknown ratios. Each of these three parts (folds) is used once as a testing set, while the remaining two are combined to form the corresponding training set. This cross-validation approach allows the user to verify that EnOpt performance does not depend on the specific examples in a given training set (i.e., the three EnOpt models trained on each fold’s training set should have similar accuracies when evaluated on the respective testing sets).During training, EnOpt assumes only known ligands are positive examples (actives) and that all other compounds do not bind. This assumption is not necessarily valid; indeed, the goal is often to identify novel ligands among the other compounds. But it is valid for the great majority of uncharacterized compounds and is a common assumption made when training novel scoring functions or generating benchmarking datasets5,14.Prediction strategyAfter training, EnOpt assesses the probability that each compound is active using a leave-one-out approach within the cross-validation framework. This approach seeks to address two challenges. First, given that many VS include few known actives (positive controls), withholding data from training (for the final evaluation) is often impractical. At the same time, EnOpt must also prevent data leakage, i.e., it must ensure that each compound is evaluated using a model that has not been trained on that same compound.To address these challenges, EnOpt separately considers each compound in the VS dataset. For each, it selects the one model of the three generated during cross-validation whose training set did not include the compound being evaluated. EnOpt then uses that model to assess the probability that the compound is active (i.e., to assign the final EnOpt score). This approach effectively balances the need to use all available data with the requirement for valid, unbiased predictions. Conceptually, EnOpt can be seen as training a regressor on the input data itself, rather than a generalized predictor applicable to outside (withheld) data.Figure 1EnOpt’s interactive visual summary. (A) Top N compounds by predicted activity probability (EnOpt score), including positive controls. (B) Docking-score distributions of these compounds. (C) How often compound docking into the specified conformation gave the best docking score of the ensemble (y-axis: number of compounds; x-axis: conformation IDs; green bars: top M most predictive conformations, per the tree models). N and M are user-defined parameters.Readable and pipeline-friendly outputActivity probabilities and performance metricsAs output, EnOpt reports the predicted probability that each compound is active per its trained machine-learning models (i.e., the EnOpt score). It ranks the compounds by these predicted probabilities and uses that ranking to calculate performance metrics such as AUROC, PRAUC, BEDROC, and enrichment factors (see the Evaluating predictive performance section below for details). For comparison, EnOpt also assesses the predictive performance of each individual conformation in the ensemble by ranking the compounds by their associated single-conformation docking scores and calculating the same performance metrics. Both the EnOpt machine-learning predictions and the single-conformation analyses are saved to CSV files, enabling easy integration into existing computational pipelines and allowing for straightforward comparison of different scoring methods.Feature importance metricsAside from calculating activity probabilities and performance metrics, EnOpt also uses feature importance methods to assign numerical scores to each conformation based on the contribution of each to the predictions. For gradient-boosted tree models, the importance metric is information gain15. For random forest models, the importance metric is Gini impurity16. The CSV output reports the per-conformation feature importance values associated with each of the three cross-validation models, allowing users to assess the consistency of importance rankings. Identifying the most important conformations is useful because these conformations can then be prioritized in subsequent analyses (e.g., re-docking into a smaller ensemble with more computationally intensive programs or settings, examining the protein-ligand interactions of the most predictive conformations in more detail, etc.).Interactive summaryFinally, EnOpt provides an interactive summary (Fig. 1) using Plotly17. This summary focuses on the top-ranked docked compounds so the user can easily assess docking results and select compounds for further study. Specifically, EnOpt provides information about all known active compounds as well as the top decoy/inactive molecules (50 by default). It also provides the AUROC metric obtained when the compounds are ranked by their EnOpt-predicted probabilities of being active, allowing the user to easily evaluate the model. Further, the interactive summary includes the average feature importance value associated with each conformation so users can quickly identify which protein conformations contribute most to the models’ predictivity.EnOpt testingTarget and compound selectionAChE benchmarkTo fairly test EnOpt, we sought to replicate the challenges typical of a real-world VS. We generated an initial docking dataset using a protocol designed to emulate a target-specific docking inquiry focused on acetylcholinesterase (AChE). We chose AChE because it is a drug target implicated in several human diseases, has multiple high-quality crystal structures available in the Protein Data Bank18 for ensemble VS, and has a large amount of bioactivity data available in PubChem19. We compiled a set of small molecules that included known ligands from PubChem as well as chemically similar decoy compounds identified using DUD-E’s “Generate DUD-E Decoys” feature14. The ratio of decoys to known ligands was approximately 50:1.DUD-E benchmarkAs a second assessment, we generated a larger test set of seven crystallographic protein-receptor ensembles taken from the DUD-E database. We first considered the 102 receptors cataloged in the DUD-E database. We discarded those receptors with two or fewer wild-type structures deposited in the PDB because they allowed only for trivially small ensemble sizes. We similarly discarded those receptors with 100 or more wild-type PDB structures because most real-world docking projects will not have access to so large an ensemble. We further focused on human proteins implicated in human diseases, per UniProt20, with five or more high-resolution PDB structures (< 2.5 Å). To ensure the test set covered diverse protein receptors, we eliminated redundant proteins belonging to the same SCOP2 protein family21,22. Another six proteins were removed from the list because, given the sizes of their ensembles and associated DUD-E compound sets, the ensemble VS would have required >200,000 dockings. Further information on selected target PDB entries can be found in the supporting information (see Supplementary Table S2). Lists of all docked compounds are available on zenodo (see Availability of data and materials section, below).LIT-PCBA benchmarkAs a third assessment, we created an additional test set of 15 receptor/ligand systems derived from the LIT-PCBA dataset (Supplementary Table S2). This dataset is specifically designed to provide a realistic and unbiased evaluation of VS methods. Unlike DUD-E, which has been shown to contain hidden biases that can lead to over-optimistic performance estimates, LIT-PCBA is based on carefully curated experimental data from PubChem bioassays23.We first downloaded the files for the 15 protein/ligand systems from the LIT-PCBA server. To ensure EnOpt had suitable ensembles for analysis, we expanded each system by downloading all structures from the Protein Data Bank18 that shared the same UniProt ID, thus representing the same protein. We then discarded those structures with mutations, resolutions greater than 3 Å (including all NMR structures, which had no specified resolutions), etc. The remaining structures were categorized as either apo or holo, depending on the presence of a non-polymer ligand.To assemble a manageable benchmark set from these files, we next identified at most 20 active compounds for each system from among the actives that the LIT-PCBA database provides. Two of the LIT-PCBA systems have fewer than 20 actives (ADRB2 and ESR1_ago); for these, we retained all actives. For the remaining 13 systems, we selected 20 representative actives by clustering the LIT-PCBA active compounds using the Butina algorithm24 and selecting one compound from each of 20 distinct clusters. After selecting active compounds for each system, we randomly selected 40 LIT-PCBA inactive compounds for each active.To create ensembles of protein structures for each of the 15 LIT-PCBA systems, we considered the identified holo (ligand-bound) and apo (unbound) protein structures separately. If there were more than 10 structures of either type, we clustered them based on structural similarity (RMSD) and selected representative conformations; otherwise, we used all available structures.Protein and ligand preparationFor the AChE and DUD-E protein targets, we removed (1) duplicate domains or multimers, (2) all co-crystallized small molecules and peptides, and (3) all water molecules. We aligned (superimposed) all structures of each ensemble using PyMol25 or UCSF Chimera26. We used OpenBabel27 to prepare target-protein and small-molecule structures for docking, protonating at pH 7.4 (e.g., “obabel input.pdb -O output.pdbqt -p 7.4 -xr” for proteins, and “obabel input.mol2 -O output.pdbqt -p 7.4” for ligands).For the LIT-PCBA protein targets, we removed all co-crystallized small molecules and peptides. We also removed crystallographic water molecules but retained metal cations, which are often important for small-molecule binding. We aligned the structures using Biopython28 and used PDB2PQR29 to prepare the target-protein structures for docking, protonating at pH 7. Similarly, we used OpenBabel27 to prepare small-molecule structures for docking, also protonating at pH 7.DockingFor the initial AChE and subsequent DUD-E screens, we used smina30 for docking. Each smina run predicts several docked poses; we considered only the top-scoring pose for further analysis. We specified the docking box using smina’s autoselection feature with a padding of 4 Å on all six sides, centered on a crystallographic ligand taken from an appropriate holo complex. For the initial docking dataset using AChE, the ligand structure and position were taken from a PDB entry for acetylcholinesterase bound to Donepezil (ID 4EY7)18,31. For the additional test cases, the ligand structure was provided by the DUD-E database14. To improve run times, we set smina’s exhaustiveness parameter to 4 for all docking runs, reduced from the default (8).We used a somewhat different docking protocol for the LIT-PCBA screens, with the goal of demonstrating EnOpt’s broad effectiveness in the context of various pipelines. Specifically, we used AutoDock Vina 1.1.2 (Vina)32 for the LIT-PCBA docking. Each Vina run also predicts several docked poses; we again considered only the top-scoring pose. We determined docking box centers by calculating the geometric center of selected crystallographic ligands. We then used MolModa33 to determine the docking box sizes required to entirely encompass these reference ligands. Finally, we used Vina to dock the prepared ligands (both actives and inactives) into each of the corresponding protein structures. For the LIT-PCBA runs, we used Vina’s default exhaustiveness setting (8).Evaluating predictive performanceFor each benchmark VS, we used in-house scripts to reformat the docking outputs into an EnOpt-compatible \(n \times m\) matrix (CSV). We then trained EnOpt models to predict the probability that a given compound is active and ranked all compounds by those probabilities (i.e., EnOpt scores), from most to least probable. To assess EnOpt’s ability to identify true ligands, we calculated multiple evaluation metrics using these ranked lists.The primary metric was the area under the receiver operating characteristic curve (AUROC). ROC curves relate true positive rates (TPR) to false positive rates (FPR) across various classification thresholds. We considered all possible thresholds, each partitioning the ordered list into presumed actives and inactives. Because we knew which compounds were true actives and inactives (decoys), we could calculate the true and false positive rates for each threshold. Plotting each (FPR, TPR) pair defines the ROC curve. The area under this curve indicates how well the model classifies the candidate ligands. More precisely, it describes the probability that a randomly chosen true active will rank better than a randomly chosen inactive/decoy.As a second metric, we considered the area under the precision-recall curve (PRAUC) as implemented in scikit-learn12,34. Precision quantifies the accuracy of positive predictions as the ratio of true positives to all predicted positives. Recall measures the completeness of positive predictions as the ratio of true positives to all actual positives. The PRAUC provides a single value that summarizes the trade-off between precision and recall across various classification thresholds, offering a comprehensive view of the model’s performance, particularly in cases with unbalanced datasets.As a third metric, we considered the Boltzmann-enhanced discrimination of receiver operating curve (BEDROC), as implemented in RDKit35. This metric modifies the traditional ROC by emphasizing the early detection of active compounds. BEDROC applies an exponential weight to compounds ranked higher in the list, reflecting the practical importance of identifying active compounds early in VS campaigns. The BEDROC calculation uses a parameter \(\alpha \) to control the emphasis on early recognition. A higher \(\alpha \) value focuses the metric on a smaller percentage of top-ranked compounds36. We selected \(\alpha = 5\), which effectively focuses on the top \(\sim \) 20% of ranked results37.As a fourth metric, we considered enrichment factors. The enrichment factor quantifies a model’s ability to preferentially rank active compounds higher than would be expected by random selection. Specifically, it is the ratio of the proportion of actives in the top N percent of ranked compounds to the proportion of actives in the entire dataset. We set N to 20% to maintain consistency with our BEDROC \(\alpha \) parameter. An enrichment factor greater than 1 indicates that the model performs better than random selection in identifying active compounds within the top-ranked results.To compare our use of predicted probabilities to more traditional approaches for ranking by ensemble scores, we separately re-ranked the compounds by their ensemble-average and ensemble-best docking scores. We similarly calculated AUROC, PRAUC, BEDROC, and enrichment factor values for these compound rankings. While most of our discussed results rely on the AUROC metric, we have included results for all four metrics in Supplementary Table S3.Scientific contributionEnOpt is a user-friendly virtual-screening (VS) tool that trains machine learning (ML) models to predict the probability that a given molecule is active. It learns to map the docking scores associated with an ensemble VS (i.e., multiple compounds, each docked into multiple protein-receptor conformations) to probability scores for compound ranking. EnOpt builds on previous research that uses ML to process ensemble-VS scores, but it wraps its methods in an easy-to-use tool that outputs predictions in an intuitive, visual format.

Hot Topics

Related Articles