Ensembling methods for protein-ligand binding affinity prediction

We evaluate individual models and the ensembles, and compare their performances with that of the state-of-the-art methods. We also investigate the reasons behind the performance levels. We first show all experimental results using training and validation sets Training2016 and Validation2016 from PDBbind 2016 and four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36. Then, we also show additional results using training, validation, and test sets from PDBbind 2020. We also show the performance of average ensembling and Fully Connected Neural Network (FCNN) based ensembling.Training using PDBbind 2016 datasetsWe train individual models using the training and validation sets Training2016 and Validation2016. We study the interpretability of our models and the impact of angle features. We explore average ensembling of all combinations of the models. We evaluate the best models on four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36. We compare the best three ensembles against the state-of-the-art methods trained using the same training and validation sets.Performance of individual modelsUsing the validation set Validation2016, we show the performance of the 13 individual models in Table 1. Among the models, laps demonstrates better performance in four metrics, such as R, RMSE, MAE, and CI, while lpst performs better in SD. Two-input-feature models la and as perform better than the other two-input-feature models due to the cross-attention layer.Table 1 Performance of the 13 individual models shown on the validation set Validation2016 from PDBbind 2016.Table 2 shows the performance of the best individual model laps on the training and validation sets Training2016 and Validation2016 and four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36. Considering all the five performance metrics, the performance of laps on the validation set Validation2016 is close to that on the training sets. However, the performance on the test sets is significantly worse than that on the training and the validation sets. Moreover, the performance on the CASF2016_290 test set is much superior to that of the other three test sets.Table 2 Performance of the best individual model laps on the training and validation sets Training2016 and Validation2016 from PDBbind 2016 and four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36.Interpretability of our modelsA nonlinear algorithm, namely t-distributed Stochastic Neighbour Embedding (t-SNE37), helps visualise high-dimensional data using a two or three-dimensional map. We use t-SNE to peek into the black boxes of individual models. For this given model, we extract the feature representation after the self-attention layer of the model for all the protein-ligand complexes in the test dataset CASF2016_290. We then use the tSNE software to get a two-dimensional representation of the extracted high-dimensional feature. Figure 1 shows the plot of the two-dimensional representation for two example models: la and lapst. These two models have respectively 2 and 5 features and 1 and 2 cross-attention layers.Fig. 1t-SNE visualization of the high dimensional feature representation after self-attention layer in models la (left) and lapst (right) on the test dataset CASF2016_290. The two axes represent the output of the two dimensions by t-SNE.Figure 1 shows that protein-ligand complexes with similar binding affinities are clustered together. In the left chart for la, the complexes with the low binding affinity values cluster towards the bottom-left part of the chart, while the complexes with the high binding affinity values cluster towards the top right part. In the right chart for lapst, a similar pattern is seen with low binding affinity values towards the top right part and high binding affinity values towards the bottom right part. These suggest that the neural networks for our models effectively learn from the input features and essentially capture the information in the feature representation extracted after the self-attention layer. This explains the performance levels of our models and ensembles.Impact of angle featuresWe analyse the impact of our angle features in binding affinity prediction by comparing the performance of models lpst and lapst. Between the two models, lpst does not use the angle features, but lapst does. Table 3 shows that model lapst achieves better performance than model lpst in all metrics on CASF2013_195 dataset. The improvements are from 0.63% to 4.11%. In other datasets, the performances are mixed when the angle features are used and when not used.Table 3 Impact of the angle feature from the performances of the models lpst and lapst on CASF2013_195 dataset.Selected ensembles of modelsWe select the three best ensembles based on their MAE performance on the validation set Validation2016 and when trained on the training set Training2016. The MAE performance is considered because the loss function is minimised during the training of the individual models. Note that in this case, we consider average ensembling, not the Fully Connected Neural Network (FCNN) based ensembling; the two ensembling approaches are described in the Methods section. Table 4 lists the selected 3 ensembles AX, AY, and AZ along with their constituent models. Notice that the set of constituent models of the ensemble AX is a subset of that of the ensemble AY. Also, the set of constituent models of the ensemble AY is a subset of that of the ensemble AZ. Further, the la and as models are constituted of all ensembles AX, AY and AZ, which depicts that protein and ligand interaction captures via angle and ligand feature through cross-attention layer.Table 4 Selected three ensembles AX, AY, and AZ based on the lowest MAE values on the validation set Validation2016 and when trained on the training set Training2016.Statistical test results: Since the performances of the three ensembles are apparently very close and they have subset relationships, we have performed Wilcoxon Signed-Rank test to check the statistical significance of the difference between each pair of ensembles \(\{\texttt {AX}, \texttt {AY}, \texttt {AZ}\}\). The p values for pairs (AX, AY), (AX, AZ), and (AY, AZ) are respectively 0.000605, \(3.43\times 10^{-10}\), and \(5.20\times 10^{-10}\). At 95% confidence level, these p values (\(\le 0.05\)) denote statistically significant difference.Best model vs best ensembleTable 5 shows that the performance of the best ensemble AX is much better than that of the individual model laps in all datasets in all metrics. This essentially shows the importance of using ensembles over the individual models.Table 5 Performance of the best individual model laps and the best ensemble AX on the training and validation sets Training2016 and Validation2016, and four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36.Performance of best ensemblesTable 6 shows the performance of the three selected best ensembles AX, AY, and AZ on the training and validation sets Training2016 and Validation2016, and four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36. Although ensemble AX is selected as the best among the three ensembles based on their MAE values on the validation set Validation2016, in Table 6, we can observe that it performs better in most of the metrics only on CSAR-HiQ_51 and CSAR-HiQ_36 test sets. On the other hand, the ensemble AY performs better in most of the metrics in the training set Training2016, and the CASF2016_290 and CASF2013_195 test sets.Table 6 Performance of the selected ensembles shown in Table 4 on the training and validation sets Training2016, Validation2016 and the four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36.Performance difference across datasetsBased on the Tables 2, 5, and 6, it is clear that the best individual model and the best ensembles outperform much better on the training and validation datasets Training2016 and Validation2016 than on the four test datasets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36. We primarily investigate the reason for the best ensembles.Figure 2 compares the relative frequencies of the real and the predicted binding affinity values on the Training2016 and Validation2016 datasets for the three best ensembles AX, AY, and AZ. The closeness of all four lines explains why the performances of the best ensembles on the training and validation datasets are very similar and quite high.Fig. 2Real and predicted binding affinity values and their relative frequencies on Training2016 and Validation2016.Figure 3 shows that the relative frequencies of the real and the predicted values quite differ for the best three ensembles AX, AY, AZ on the four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36. This explains the relatively worse performance of the ensembles on the four test sets than on the training and validation sets Training2016 and Validation2016. Figure 4 shows that the distributions of the real binding affinity values in the four test datasets are quite different from those in the training and the validation datasets. This reconfirms the differences in the performance levels and is quite expected for any machine learning algorithm when the data distributions quite vary.Fig. 3Real and predicted affinity values and their relative frequencies for the best ensembles on the four test sets.Fig. 4Real binding affinity values and their relative frequencies for the datasets.Five-fold cross validationWe employ a five-fold cross-validation to analyse the robustness of our three ensembles AX, AY and AZ methods. Table 7 shows an average of R and standard deviation of R for five-fold on the CASF2016_290 test dataset in three ensembles. The standard deviation of the five-fold cross-validation is less than or equal to 0.005; which demonstrates that variation between folds is very minute, and it shows the effectiveness of our ensembles in terms of accuracy and reliability.Table 7 5-Fold cross-validation on the CASF2016_290 test dataset.Comparison with the state-of-the-art methodsTable 8 Performance of the three best ensembles AX, AY, AZ and existing methods on the four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36.Table 8 compares the performance of the best ensembles AX, AY, and AZ with that of the 12 existing state-of-the-art methods on four datasets. The existing methods are CAPLA4, DeepDTA32, DeepDTAF33, Pafnucy30, OnionNet38, IMCP-SF23, SFCNN31, DLSSAffinity34, FAST39, X-Score40, AutoDockVina10 and T_Graph-GCN41. Among them, CAPLA, DeepDTA, and DeepDTAF are sequential-based methods, while Pafnucy, SFCNN, OnionNet, IMCP-SF, FAST, and T_Graph-GCN are structure-based methods, and X-Score and AutoDockVina are empirical-based methods. The method of DLSSAffinity and our proposed ensembles are hybrid methods. Note that all these methods compared are trained using the same training and validation sets Training2016 and Validation2016. Therefore, their performances could be compared with each other.Performance on CASF2016_290 dataset: Table 8 shows that in this dataset, ensemble AY outperforms all the other methods in terms of 4 out of metrics such as R, RMSE, MAE and CI. The second and third best methods, in most cases, are also the ensembles. The CAPLA, however, is the best method among the existing methods in all five metrics.Performance on CASF2013_195 dataset: Table 8 shows that in this dataset, ensemble AY outperforms all the other methods in R and CI, but CAPLA outperforms all other methods in RMSE, MAE, and SD. The second and third best methods in all cases are the ensembles. In an overall, considering first and second positions in all the metrics, ensemble AY is the best-performing ensemble, while CAPLA is the best method among the existing methods in all five metrics.Performance on CSAR-HiQ_51 dataset: Table 8 shows that in this dataset, ensemble AX outperforms all other methods in all five metrics. The second and third best methods in all cases are the ensembles. CAPLA is the best method among the existing methods in all five metrics.Performance on CSAR-HiQ_36 dataset: Table 8 shows that in this dataset, ensemble AX outperforms all other methods in three metrics such as RMSE, MAE, and SD. Ensemble AZ outperforms all other methods in two metrics, such as R and CI. The second and third best methods in all cases are the ensembles. CAPLA is the best among the existing methods in all five metrics.Overall performance: Overall, ensemble AY is the best performer in CASF2016_290 and CASF2013_195 datasets while ensemble AX is the best performer in CSAR-HiQ_51 and CSAR-HiQ_36 datasets. CAPLA is the best performer among the existing methods in all four test sets. Table 9 shows the improvement made by the best ensemble for each dataset over CAPLA. Except for SD in CASF2016_290 and RMSE, MAE, and SD in CASF2013_195, we see improvements that range from 0.42% to 16.32% over various metrics. The improvements are comparatively higher, particularly in the CSAR-HiQ_51 and CSAR-HiQ_36 datasets.Table 9 Improvements (%) achieved in each test set by the best ensemble for the test set over CAPLA. Improvement is \(\frac{m_\text {Ensemble} – m_\text {CAPLA}}{m_\text {CAPLA}}\) for R and CI, and \(\frac{m_\text {CAPLA} – m_\text {Ensemble}}{m_\text {CAPLA}}\) for RMSE, MAE, and SD where \(m_\text {CAPLA}\) and \(m_\text {Ensemble}\) are the respective metric values for the models CAPLA and the respective ensemble.Statistical test results: Based on the Table 8, we can observe, that ensemble AY performs the best on CASF2016_290 and CASF2013_195 datasets while ensemble AX performs the best on CSAR-HiQ_51 and CSAR-HiQ_36 datasets. Among the existing methods, CAPLA performs the best in all four test datasets. Table 10 shows the p values of the Wilcoxon Signed-Rank test for CAPLA vs our best ensembles. For CASF2016_290 and CASF2013_195, the p values are less than 0.05 and so the differences in the performance between CAPLA and the ensembles are statistically and significantly different at 95% confidence level. For CSAR-HiQ_51 and CSAR-HiQ_36, the p values are larger than 0.05, denoting that the differences are not statistically and significantly different at 95% confidence level. Note that these latter two datasets are much smaller than the former two datasets.Table 10 Wilcoxon Signed-Rank test p values for CAPLA and our best ensembles AX, AY, and AZ.Training using PDBbind 2020 datasetsWith the model combinations of the best three ensembles AX, AY, and AZ, we run a range of other experiments. We retrain them using the training and validation sets Training2020 and Validation2020 from PDBbind 2020 dataset. Besides CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36 test sets, we use another test set PDBbind2020_363. We see the screening performance of the average ensembling approach. Further, we use FCNN-based ensembling.Average ensemblingTable 11 shows the performances of ensembles AX, AY and AZ on CASF2016_290, CASF2013_195, CSAR-HiQ_51, CSAR-HiQ_36, and PDBbind2020_363 test sets when trained using training and validation sets from PDBbind 2016 (Training2016 and Validation2016) and PDBbind 2020 (Training2020 and Validation2020) datasets. We see the ensembles performed better on the four test sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, and CSAR-HiQ_36 when trained using the training and validation sets from PDBbind 2020 dataset than when using the training and validation sets from PDBbind 2016 dataset. These results are explainable since PDBbind 2020 dataset has more protein-ligand complexes than PDBbind 2016 dataset.Table 11 Performance of ensembles AX, AY and AZ on CASF2016_290, CASF2013_195, CSAR-HiQ_51, CSAR-HiQ_36, and PDBbind2020_363 test sets when trained using training and validation sets from PDBbind 2016 (Training2016 and Validation2016) and from PDBbind 2020 (Training2020 and Validation2020) datasets.Screening resultsWe analysed the effectiveness of the ensembles AX, AY and AZ methods by virtual screening procedure. A benchmark dataset for screening is Database of Useful Decoys: Enhanced (DUD-E) https://dude.docking.org. The dataset provides proteins with ligands. Each protein has bindable ligands called actives and non-bindable ligands called decoys. We randomly selected a protein named Serine/threonine-protein kinase (akt1), which has 423 actives and 16576 decoys. We generated 16,999 protein-ligand complexes by using AutoDock Vina by configuring a docking grid with the size of 20 Ã… cube at the centre of the ligand. After that, we extracted five features, ligand atoms, ligand SMILES, angles, protein, and pockets, from each protein-ligand complex. We evaluated our ensembles AX, AY, and AZ methods by exploring differentiated capability between actives and decoys. Screening performance uses two crucial metrics: the Receiver Operating Characteristic (ROC) curve42 and the Enrichment Factor (EF)43.Fig. 5Performance of ensembles AX, AY and AZ in screening: ROC curve (left) and EF (right). The ensembles are trained using the training and validation sets Training2020 and Validation2020 from PDBbind 2020 dataset.Figure 5 illustrates efficacy of ensemble AX, AY and AZ in virtual screening. In an analysis of the ROC curve, we see that all three ensembles achieve equal to or greater than 0.70 AUC values; which demonstrates our ensembles’ proficient capability to differentiate between actives and decoys. Also, the ROC curve with true positive rates alongside false positive rates depicts the models’ efficacy. Furthermore, Fig. 5 EF plot illustrates 10% top of data by prioritizing active compounds and depicts early identification of promising compounds.Using FCNN-based ensemblingWe compare best two average ensembles AX and AY with two FCNN-based ensembles FX and FY. All ensembles are trained using the training and validation sets Training2020 and Validation2020 from PDBbind 2020 dataset. Note Ensembles AX and FX have the same constituent models. Similarly, Ensembles AY and FY have the same constituent models. Table 12 shows that FCNN-based ensembles perform better than average ensembles on four data sets CASF2016_290, CASF2013_195, CSAR-HiQ_51, CSAR-HiQ_36, and PDBbind2020_363. where FCNN-based ensembling outperforms in all metrics on PDBbind2020_363, CASF2016_290 and CASF2013_195. In CSAR-HiQ sets, FCNN-based ensembling shows better performance in the majority of metrics. These results show the effectiveness of FCNN-based ensembling over average ensembling.Table 12 Comparison of performance of FCNN-based ensembling (FX and FY) and average ensembling (AX and AY).Comparison with the state-of-the-art methodsWe compare our best ensembles FX and FY with existing state-of-the-art predictors CAPLA4, PSICHIC36, TankBind44, STAMP-DPI45, and GNINA46. Table 13 shows that ensembles FX and FY significantly outperform the existing predictors. Ensemble FY shows remarkable improvement on CASF2016_290 in terms of R and RMSE over the second best predictor CAPLA by 8.4% and 20.3%, and PSICHIC by 15.4% and 28.4% respectively. Also, Ensemble FX shows significant improvement in terms of R and RMSE over CALPA by 13.6 % and 18.6% respectively on CASF2013_195, meanwhile Ensemble FX achieves improvement in terms of R and RMSE over CALPA greater than 15% and 19% respectively on both CSAR-HiQ. Further, ensembles significantly outperform four state-of-the-art methods on the PDBbind2020_363 test dataset. Ensemble FY achieves improvement in terms of R and RMSE over PSICHIC by 4.8% and 5.9%, and TankBind by 3.6% and 8.0%, respectively, on the PDBbind2020_363 test dataset. Overall, these results show that ensembles are effective in predicting binding affinity accurately on the five datasets (Table 14).Table 13 Comparison of performance of ensembles FX and FY with state-of-the-art methods on CASF2016_290, CASF2013_195, CSAR-HiQ_51 and CSAR-HiQ_36 test datasets trained using Training2020 and Validation2020 sets.Table 14 Comparison of performance of ensembles FX and FY with state-of-the-art methods trained using Training2020 and Validation2020 sets and performance on PDBbind2020_363 test set.Analysis of diversification of individual modelsWe explore the strength of the ensembles by utilizing the diversification of individual models in predicting binding affinity rather than a single model. Table 15 illustrates individual models’ performance as well as the performances of ensembles AY and FY on the PDBbind2020_363 benchmark test set. Note all models are trained using the training and validation sets Training2020 and Validation2020 from PDBbind 2020 dataset. We see that Ensemble AY outperforms the individual models in all metrics and Ensemble FY outperforms AY. The standard deviation of RMSE, MAE, and SD between individual models is greater than 0.10; which indicates a significant variance between individual models. Individual models la, as, and st provide short-range interaction information, while models lpt, aps, pst, laps, and lpst provide both short-range and long-range interaction information between proteins and ligands. Therefore, ensembles AY and FY show performance improvement and generalization capability compared to a single model.Table 15 Analysis of diversification of individual models of improved ensemble AY on PDBbind2020_363 benchmark test set.DiscussionWe see that our EBA methods demonstrate improvement in predicting binding affinity between protein and ligand complexes in all five benchmark test datasets. EBA methods are capable of extracting short-range and long-range interaction information between proteins and ligands by utilising different combinations of features. EBA methods achieve the highest value in R and CI while the lowest value in RMSE, MAE, and SD in comparison with the state-of-the-art methods in all five test sets. Also, EBA methods achieve the highest R-value of 0.914 and an RMSE value of 0.957 on the well-known benchmark test set CASF2016 compared to all the state-of-the-art methods. Further, EBA methods shows remarkable improvement in terms of R by 15.4% and 13.6% on CASF2016 and CASF2013, respectively, compared to the second-best predictor, CAPLA. In addition, EBA methods yield significant improvements such as more than 15% in R-value and 19% in RMSE on both well-known benchmark CSAR-HiQ test sets over CAPLA. The superior performance of EBA on all test sets expresses that the EBA has generalization capability, and it provides an effective approach to predicting binding affinity between proteins and ligands accurately regardless of various distributions of real binding affinity among test sets.

Hot Topics

Related Articles