BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias | BMC Bioinformatics

Investigation of model complexity relative to data size in binding affinity prediction methodsWe began our analysis with the suspicion, based on a review of existing studies, that the high accuracy of current models might be due to an underlying bias in the datasets. To explore this, we examined the number of trainable parameters and the amount of input data used in the latest compound-protein binding affinity prediction methods.Table 2 presents the number of trainable parameters, input data size, and performance of six state-of-the-art binding affinity prediction models: MMD-DTA [45], ProSmith [51], NHGNN-DTA [52], SSM-DTA [14], ColdDTA [15], and FusionDTA [53]. The results show high model complexity, with parameter-to-data ratios ranging from approximately 42 to 866. Despite these high parameter-to-data ratios, the models achieved impressive test performances, with CI values often exceeding 0.90.Table 2 Number of trainable parameters, data size, and performance of existing DTA prediction modelsThis unexpectedly high performance under challenging training conditions suggests the presence of bias within the datasets. This implies that the models might predict binding affinity without truly utilizing the interaction information between compounds and proteins.Revealing structural bias in datasets through compound and protein indicesTo investigate the potential bias in datasets further, we conducted a comprehensive analysis using eight databases: PDBbind [9], BindingDB [16], ChEMBL [17], IUPHAR [18], GPCRdb [19], GLASS [20], Davis [21], and NR-DBIND [22]. We generated compound and protein indices by ordering the average binding affinity values and used these indices to perform regression analyses.Database-specific analysisWe analyzed each database individually to determine whether binding affinity could be predicted using compound and protein indices. For each database, predictions were fitted with a third-degree polynomial function. The results revealed that in almost all the databases (7 out of 8), predictions using the compound index achieved a PCC of 0.8 or higher (Fig. 2). The exception was the Davis dataset, where the binding affinity values varied significantly for each compound, preventing the bias of prediction solely based on the compound index. However, the Davis dataset included only 72 compounds and focused exclusively on kinase targets, making it less suitable for generalization because of its limited scope. Predictions using the protein index showed greater variability and generally lower performance compared to the compound index. Although we restricted our final dataset to human proteins for consistency, we also repeated the analysis on the entire PDBbind dataset without filtering. The results showed that, in both the filtered and unfiltered datasets, binding affinity predictions based solely on compound information achieved a PCC exceeding 0.95, indicating that the observed bias persisted regardless of the filtering for human proteins.Fig. 2PCCs for binding affinity prediction using compound indices (orange) and protein indices (cyan) across various databases: PDBbind, BindingDB, ChEMBL, IUPHAR, GPCRdb, GLASS, Davis, and NR-DBIND. For each database, two bars represent the correlation strength, with specific values annotated above each bar for clarity.These findings suggest that the datasets allow for accurate binding affinity predictions using only compound features, without fully capturing interaction information. Previous studies have suggested that binding affinity can be predicted using only compound or protein structures through the analysis of complex structures in the PDBbind dataset [10], although this was not clearly demonstrated. Our analysis confirmed the bias that binding affinity can be predicted using only compound features. However, the ability to predict binding affinity using protein features varied across databases. When all the databases were combined, predictions using compound features remained accurate, whereas those using protein features were less accurate. This finding can be expressed as follows: compound-protein binding affinity ≅ f(rank(compound)), where f is a polynomial function, and rank is the mean order of binding affinity. From a machine learning perspective, using a label (binding affinity) to create a feature (compound index) introduces data leakage [54], making accurate performance evaluation impossible. While this data leakage prevents full trust in the current regression performance, our results uncovered a critical characteristic of the data: compound-protein binding affinity can be fitted based on the ranking of compound binding affinities, indicating an inherent bias in the dataset structure.Integrated database analysisBy integrating the databases, we analyzed inherent biases across the entire dataset and identified common bias patterns. This analysis revealed that binding affinity predictions based on the compound index fit well with a third-degree polynomial function, achieving a PCC of 0.921 (Fig. 3a). In contrast, the protein index was less accurate, with a PCC of 0.663 (Fig. 3b).Fig. 3Cubic regression of binding affinity using compound and protein indices. Scatter plots show binding affinity against A the rank compound index and B the rank protein index with a third-degree polynomial (cubic) regression line (red) fitted. The PCCs for these regressions are displayed in red text. Data from all the databases were combined to calculate these PCC values.Binding affinity prediction using structural features of low and high variation compoundsTo further explore the observation that the compound index can predict binding affinity, we examined whether the binding affinities for each compound were concentrated within a specific range. We calculated the CV for each compound and categorized them as either high or low variation based on the mean CV of the approved drugs. Among the compounds that bind to multiple proteins, 81.9% (50,432 out of 61,603 compounds) were low variation compounds (Fig. 4a).Fig. 4Predicting binding affinity using ECFP4 and CV per compound. A Density plot of the CV per compound, with the mean CV of approved drugs marked by a red dashed line. The proportions of compounds classified as having low variation (81.9%) and high variation (18.1%) based on this CV threshold are shown in blue text. Regression plots for binding affinity prediction of B low variation compounds and C high variation compounds using ECFP4. The scatter plots compare the predicted binding affinity (X-axis) versus the observed binding affinity (Y-axis), with the linear regression line in blue. The PCCs are displayed in blue text, with PCC = 0.851 for low variation compounds and PCC = 0.159 for high variation compounds.For these low variation compounds, we investigated whether binding affinity could be predicted solely using compound features. We developed a regression model using ECFP4 [25] features and a standard MLP neural network. The model achieved a PCC of 0.851 on the test set, demonstrating that the structural features of the compounds alone could accurately predict the binding affinity of low variation compounds (Fig. 4b). This relationship can be expressed as follows: low variation compound-protein binding affinity ≅ f(ECFP4(compound)), where f is an artificial neural network.In contrast, for high variation compounds, which constitute 18.1% of the dataset, the binding affinity predictions were significantly less accurate, with a PCC of 0.159 (Fig. 4c). This finding indicates that for high variation compounds, the binding affinities are more dispersed across different proteins, making it difficult to predict using only the compound features.This discovery revealed a paradox: for low variation compounds, binding affinity could be predicted using only the compound structures without considering the protein targets. However, high variation compounds require additional factors, such as protein features. This highlights a critical characteristic of the dataset: the inherent bias toward compounds whose binding affinities are concentrated within a narrow range.Evaluating binding affinity prediction using combined compound, protein, and interaction featuresWe aimed to evaluate whether binding affinity predictions for low CV compounds could be improved by incorporating protein and interaction features along with compound features, using a standard MLP model. Among the models using individual features, the compound model using ECFP4 achieved a PCC of 0.851, outperforming the protein (Seq) and interaction (IFP) [29] feature models, which had PCCs of approximately 0.7 (Fig. 5).Fig. 5Predicting binding affinity for low variation compounds using different feature types and their combinations. The bar plot presents PCCs for predicting the binding affinity of low variation compounds using various feature types: FP (ECFP4), Seq (protein sequence encoding), and IFP. The first three bars represent individual features: FP, Seq, and IFP. The following three bars show combinations of two features: FP + Seq, FP + IFP, and Seq + IFP. The rightmost bar represents the combination of all three features: FP + Seq + IFP. The height of each bar indicates the PCC value, with specific values annotated above each bar for clarity.Combining different pairs of features resulted in moderate improvements in prediction performance. The model combining compound and protein features achieved a PCC of 0.898, whereas the model combining compound and interaction features had a PCC of 0.87. The combination of protein and interaction features resulted in a PCC of 0.768. When all three features—compound, protein, and interaction—were combined, the model achieved a PCC of 0.9, which was comparable to the performance of the compound and protein combination.This trend is consistent with previous studies analyzing binding affinity prediction based on complex structures [11]. Among our feature-based models, the compound feature model was the most effective among the individual feature models. Combining two features generally improved the prediction accuracy, but incorporating all three features did not result in significant performance gains compared with the model using compound and protein features. In conclusion, for low CV compounds, compound features played a decisive role in binding affinity prediction, with protein features providing a slight improvement. A similar analysis was conducted for high CV compounds, with the results included in the supplementary material (Fig. S1), revealing that combining features also improved prediction performance for this group.Feature importance in the combined feature model using SHAP analysisTo determine whether the prediction results of the combined compound, protein, and interaction feature model depended mainly on compound features for low CV compounds, we analyzed feature importance using SHAP [32] values. Figure 6 shows the most important features based on the average absolute SHAP values. The higher a feature is on the plot, the greater its impact on the model predictions.Fig. 6Key features in the combined ECFP4, Sequence, and IFP model based on SHAP values. This plot displays the top 20 most influential features in the combined model, ranked by their mean absolute SHAP values. Each dot represents an instance in the test set, positioned on the X-axis by its SHAP value. ECFP4 bits are color-coded: red for “On” (value = 1) and blue for “Off” (value = 0). Sequence values are indicated by a gradient color scale, with blue for lower values and red for higher values.The results indicate that ECFP4 features dominate the top 20 important features for low CV compounds, suggesting that the model relies heavily on ECFP4 bits. Sequence features also appear in the top 20 but are less prevalent. The interaction fingerprint (IFP) features did not fall into the top 20, indicating that they have less impact than ECFP4 and sequence features.The distribution of dots along the x-axis for each feature indicates how consistent or variable its influence is across samples. For instance, ECFP4_Bit_926 shows a significant spread in SHAP values, indicating both positive and negative impacts depending on the sample. However, the cluster of red dots on the positive side suggests a generally greater positive impact. The high ranking of ECFP4 bits, which have the smallest feature dimensions, suggests that for low CV compounds, binding affinity can indeed be predicted using only compound features. In contrast, for high CV compounds, a similar SHAP analysis revealed that protein features play a more prominent role, as shown in the supplementary material (Fig. S2), highlighting the differences in feature importance between the low and high CV groups.ECFP4-based UMAP analysis of structural differences between low CV and high CV compoundsTo understand why many compounds exhibit low CV, we hypothesized that low CV compounds possess distinct structural features leading to consistent binding affinities across various targets. To test this hypothesis, we used ECFP4 to perform UMAP [34] embedding to visualize and compare structural differences between low and high CV compounds. We calculated the ECFP4 embeddings for each compound and then applied UMAP to reduce the dimensionality for visualization. The resulting UMAP plot (Fig. 7) shows the distribution of low CV (blue) and high CV (red) compounds based on their structural features.Fig. 7UMAP analysis of structural differences between low and high CV compounds. This UMAP plot visualizes the structural differences between low CV (blue dots) and high CV (red dots) compounds based on their ECFP4 features. The two UMAP dimensions, UMAP1 and UMAP2, are plotted on the X and Y axes, respectively.The UMAP results indicate that high CV and low CV compounds do not form distinct clusters, suggesting that there is no significant structural differentiation between the two groups. This implies that structural features are not the primary factor contributing to the observed variations in binding affinity.Comparing the similarity among target proteins for each compound in low and high affinity variation groupsWe speculated that the consistent binding affinities observed for low CV compounds could be attributed to the high similarity among their target proteins. To test this hypothesis, we calculated the amino acid sequence similarity [35, 36] among the target proteins of each compound and compared the low variation group to the high variation group.The results showed that the sequence similarity among the target proteins of low variation compounds was significantly higher than that of high variation compounds (Fig. 8a). Additionally, for low variation compounds with an average sequence similarity below 0.5, we calculated the functional similarity based on Gene Ontology [38, 40]. The functional similarity among the target proteins of low variation compounds was also significantly higher than that of high variation compounds (Fig. 8b).Fig. 8Comparison of similarity among target proteins for low and high affinity variation groups. Boxplots comparing A sequence similarity and B functional similarity among target proteins for each compound in the low and high variation groups. Functional similarity (B) is calculated using only data with average sequence similarity below 0.5. Similarity scores are displayed, with statistical significance of mean differences indicated by p values from t-tests.These findings suggest that the consistent binding affinities observed for low CV compounds were due to the high sequence or functional similarity of their target proteins. This finding indicates that while structural differences in compounds did not account for the variation, the similarity in the target proteins was the key factor.Evaluating the effect of similarity bias on the performance of binding affinity prediction modelsGiven the nature of these datasets, we realized that properly controlling the similarity between the training and test sets is essential for a fair evaluation of binding affinity prediction models. Randomly splitting the data often results in test sets containing proteins highly similar to those in the training set, leading to overoptimistic performance estimates.To address this issue, we fixed the test sets and gradually lowered the average protein similarity between the training and test sets using an integrated similarity value that accounts for both sequence and functional similarities. As similar data were progressively removed, the size of the training set decreased accordingly. To distinguish the effects of decreasing training data from those of similarity reduction, we also conducted a control experiment by randomly subsampling the same number of data points from the training set at each similarity cutoff.First, we evaluated the simple MLP model combining ECFP4, protein sequence encoding, and IFP. The results revealed a significant decrease in the PCC as the average integrated similarity between the training and test sets decreased from a similarity cutoff of 1, where similarity was not considered. The PCC decreased from 0.867 to 0.328, indicating that the model performance was heavily influenced by similarity bias (Fig. 9a). As the similarity decreased, both the CI [48] and the classification performance based on a threshold of \(1{\text{uM}}\) also significantly declined (Table 3).Fig. 9Effect of decreasing protein similarity between the training and test sets on binding affinity prediction performance. Line plots display test set PCCs for the custom-developed A simple MLP model, and the state-of-the-art models B ColdDTA and (C) MMD-DTA, as the similarity cutoff between the training and test sets is adjusted. The red line represents the performance excluding samples above the similarity cutoff, whereas the black line represents the performance of control datasets generated by random sampling to match the number of samples at each similarity cutoff.Table 3 Regression and classification performance of simple MLP with varying similarity cutoffsFor ColdDTA, the regression PCC on the test sets decreased from 0.9 to approximately 0.3 as the average integrated similarity between the training and test sets decreased (Fig. 9b). This indicates that the high prediction performance was due to the similarity bias inherent in the dataset. Similarly, performance metrics including the CI and classification based on a \(1{\text{uM}}\) threshold also declined (Table 4).Table 4 Regression and classification performance of ColdDTA with varying similarity cutoffsMMD-DTA showed a similar trend, with some variability in performance decline as the similarity between the training and test sets decreased (Fig. 9c, Table 5). These results demonstrate that current models rely heavily on the similarity between the training and test sets and fail to reliably predict the binding affinity for targets that are not similar to the targets in the training set.Table 5 Regression and classification performance of MMD-DTA with varying similarity cutoffsThese findings align with existing research on machine learning-based scoring functions for estimating binding affinity using complex structures, where a decrease in protein similarity between the training and test sets also led to performance degradation [42]. This consistency underscores the importance of considering protein similarity when evaluating binding affinity prediction models. Additionally, it highlights the necessity of training and evaluating models on datasets where such biases are minimized to ensure reliable and generalizable predictions.Web service for providing bias-reduced datasets and its impact on binding affinity predictionWe developed a web-based platform named Binding Affinity Similarity Explorer (BASE), which provides datasets that can be used to develop more robust and generalizable binding affinity prediction models by addressing the similarity bias between training and test sets. BASE allows users to create customized training sets by excluding proteins similar to those in the test set, thereby reducing bias. Users can define similarity thresholds based on three types of similarities—protein sequence, gene ontology, and integrated similarity—relative to the test set. The training and test sets used in the evaluation results for the MLP, ColdDTA, and MMD-DTA models (reported in Tables 3, 4, and 5) can be accessed and downloaded from the Data Browser tab on the BASE website. Specifically, these datasets can be found under the “integrated” similarity type, with options to select different similarity cutoffs (Fig. 10). In addition, we provide prediction results for each model under various similarity cutoffs through the Running Examples tab, allowing users to visualize how prediction performance changes as similarity thresholds are adjusted.Fig. 10BASE web service data browser tab interface. This interface of BASE allows users to split training and test sets by protein similarity. Users can select similarity types and adjust the similarity cutoff, which updates the number of selected training samples displayed in blue on the line graph. The “Select Training Set” button shows the dataset information in table form, and datasets can be downloaded as CSV files using the “Download Train Set” and “Download Test Set” buttons. Clickable and selectable items are highlighted with red lines.To validate the effectiveness of these bias-reduced training sets, we conducted a SHAP analysis on the feature importance of a simple MLP model combining compound, protein, and interaction features. Using ECFP4 (1024 bits), sequence encoding (1200 length), and IFP (1540 length), we created a feature set of 3764 dimensions. We then extracted the top 500 features based on their mean absolute SHAP values to understand the overall distribution of feature types. Initially, without considering similarity (similarity cutoff = 1), more than 75% of the top 500 features were protein features. As the similarity cutoff decreased to 0.5, the proportion of protein features decreased to less than 50%, whereas the proportions of compound and interaction features increased from 22.2% and 2% to 37.4% and 12.8%, respectively (Fig. 11). This shift indicated that models trained on our proposed bias-reduced dataset began to balance the importance of various features, especially increasing the significance of interaction features.Fig. 11Proportion of the top 500 features by type across different similarity cutoffs. This plot illustrates the distribution of feature types among the top 500 features ranked by mean absolute SHAP values: compound (ECFP4), interaction (IFP), and protein (sequence). The features are identified from models trained on datasets filtered by different similarity cutoffs and evaluated on a consistent test set. The X-axis represents the similarity cutoff values, whereas the Y-axis represents the percentage distribution of each feature type. The colors indicate the feature types: orange for compounds, green for interactions, and blue for proteins.Although the test performance decreased when these bias-reduced datasets were used, the models began to balance the importance of various features. By reducing the reliance on protein similarity, the models were able to develop a more comprehensive understanding of the factors that contribute to binding affinity.

Hot Topics

Related Articles