VEP performance across different human disease genes is highly heterogeneousTo investigate the performance of VEPs across different genes, we compiled a large set of human missense variants, based on the same strategy used in our previous study16. Pathogenic variants were obtained from ClinVar33, using those classified as pathogenic and likely pathogenic, while putatively benign variants observed in the human population were obtained from the gnomAD v2 dataset34, excluding any that were also present in the pathogenic set. We referred to the gnomAD variants as “putatively benign” because, although this set is likely to contain a small proportion of disease-causing variants, particularly from genes with recessive inheritance or incomplete clinical penetrance, the vast majority are expected to be non-pathogenic. This approach has been widely used in previous studies25,28,35. Moreover, we suggest that this is more reflective of the actual clinical utilisation of VEPs, whereby the challenge is in distinguishing pathogenic variants from rare, unclassified-but-benign variants, rather than known benign variants, which tend to be common and easy to identify36,37.We investigated 35 different VEPs using variant effect scores generated and compiled in our recent benchmarking study, split into “supervised” and “unsupervised” categories16. We started with the full set of 55 VEPs included in that study, many of which came from the dbNSFP 4.2 database38. We excluded nucleotide-level predictors and substitution matrices which tend to perform relatively poorly on missense variants16. We also only included VEPs from that study with predictions available for at least 75% of the total missense variants in our dataset. Finally, we also included predictions from the recent methods AlphaMissense17and ESM-1b39, due to their high-profile publications and expected general interest.To assess VEP performance on each gene, we utilised the area under the receiver operating characteristic curve (AUROC) metric40, which, in our experience, is likely the most widely used metric for assessing discrimination between pathogenic and benign variants. AUROC quantifies the trade-off between the true-positive rate (TPR) and the false-positive rate (FPR) across all possible classification thresholds to distinguish between two classes. A higher TPR and lower FPR indicate better classifier performance in discriminating between the two classes. Therefore, the curve of the classifier closer to the top-left corner of the graph indicates better classification performance. In addition to its widespread usage, we chose AUROC as the metric for assessing VEP because, in principle, it should be directly comparable between different genes with varying numbers of pathogenic and putatively benign variants. This is because the AUROC metric, by definition, is independent of class balance: regardless of the fraction of pathogenic missense variants within a gene, the null hypothesis value for AUROC will always be 0.541. This is also especially useful for evaluating VEPs given the fact that the pathogenic and putatively benign variant datasets are effectively independent of each other. For instance, if the putatively benign variant dataset increases in size (e.g. due to using a later version of gnomAD), the AUROC should remain very similar.In total, we obtained 963 human protein-coding genes with at least 10 missense variants in each group. While more genes could be included if we lowered this threshold, we were concerned that AUROC values would be less reliable for these genes. Thus, VEP heterogeneity would likely be increased simply due to the inclusion of genes with fewer variants. We also note that this is the same threshold we have used in previous studies16,27.Figure 1 illustrates the performance of the VEPs across all of the protein-coding genes in our dataset, sorted based on median AUROC. It is important to emphasise that this plot is not intended to provide a quantitative ranking of relative VEP performance, given that most of the included predictors are based on supervised machine learning and have likely been trained on variants in our datasets. Therefore, their apparent performance in discriminating between pathogenic and putatively benign variants will almost certainly be inflated by type 1 circularity14. However, the relative performance of unsupervised VEPs is consistent with previous work16, with EVE and ESM-1v ranking first and second, respectively.Fig. 1Heterogeneous performance of VEPs in identifying pathogenic missense variants across different human protein-coding genes.The distribution of AUROC values across 963 human protein-coding genes. The black dots refer to the median AUROC. The models were classified into either supervised or unsupervised based on the same classifications used previously16. Note that the relative performance of different supervised models is of limited reliability due to the issue of data circularity14.For purposes of this study, the most important observation from Fig. 1 is the tremendous heterogeneity of individual VEPs across different genes. All predictors demonstrate much stronger performance on some genes than others, suggesting that the reliability of a given VEP can vary greatly from one gene to another. While one potential approach to address this is by considering the outputs of multiple VEPs, as has generally been recommended19, we also observe that the performance of different VEPs across different genes can be highly correlated (Fig. 2). Thus, it is clear that VEP performance varies greatly across different genes, and those genes that have higher AUROC values for one VEP will also tend to have higher AUROC values for other VEPs.Fig. 2Correlation of performance of different VEPs across human protein-coding genes. This matrix shows the Pearson correlation between AUROC scores across 963 human protein-coding genes for different VEPs. VEPs with high correlations do not necessarily produce similar scores to each other but show strong correspondence between the genes for which they perform better or worse, as measured by AUROC. VEPs are ordered by median AUC, as in Fig. 1.Another common approach for assessing the performance of binary classifiers like VEPs involves calculating the area under the precision-recall curve (AUPRC). This strategy can be very useful when comparing different predictors across the same set of variants because AUPRC is sensitive to class balance and so may better reflect the real-world application of a predictor, in particular if the putatively benign variants far outnumber the pathogenic variants, as is often the case. However, for the same reason, AUPRC values are not comparable across genes with different pathogenic vs. benign class balances41. The null hypothesis AUPRC value is equal to the fraction variants that are pathogenic and thus can vary greatly from gene to gene. To illustrate this, in Figure S1 we plot the AUPRC in an analogous manner to AUROC in Fig. 1. We observe extreme heterogeneity, with values ranging from nearly 0 to 1 across all predictors, thus supporting the use of AUROC as the focus of this study.VEP performance on different genes as measured by AUROC is predictableGiven the demonstrated variability of VEP performance across different protein-coding genes, we wondered to what extent this could be explained by gene- and protein-level properties. To address this, we trained random forest (RF) models to predict the AUROC for all 35 VEPs in our study. RF is an ensemble tree-based learning method suitable for both regression and classification tasks42. RF stacks many shallow or weak decision trees called “stumps” or weak estimators which are trained utilising a feature called “bagging” in which the training dataset is shuffled and split randomly across different estimators. Through bagging, RF is more tolerant to overfitting training data since each estimator sees the dataset differently. In regression problems, the outputs of these weak estimators are averaged out to yield the final prediction. In addition, RF is capable of capturing non-linear relationships among independent variables and the response variable (i.e. AUROC) and is very easy to interpret.We compiled 99 features for all protein-coding genes in our dataset, covering a broad range of properties, such as evolutionary conservation, biological functions, protein structural properties, and observed variant properties in gnomAD (Table S1). We used these properties to train individual RF models for each VEP. Crucially, we used a non-redundant subset of 788 genes, filtered so that no proteins showed > 30% sequence identity to each other, to ensure that the performance of our models was not influenced by training and testing on homologous proteins. The models were trained using the optimal set of hyperparameters selected through Bayesian hyperparameter optimisation. To obtain robust evaluation statistics, we performed 100 repeated hold-out cross-validations leveraging its effectiveness in the context of a limited dataset size, randomly shuffling, and splitting the data into 80% training and 20% evaluation sets for evaluating the machine learning models.Figure 3 displays the Spearman correlation on the testing set and the coefficient of determination (R2) for each model (i.e., between predicted AUROC and real AUROC values for the test set). We observe that the AUROC can be predicted to varying degrees across all VEPs, with some models being predicted much better than others. Among the supervised models, there does appear to be some tendency for better performing VEPs to be better predicted. For example, the four VEPs for which AUROC can be best predicted, as measured by Spearman correlation, are ClinPred, AlphaMissense, BayesDel and MetaRNN, and these also have the highest AUROC values in Fig. 1. Interestingly, however, this does not appear to hold for the unsupervised models, with SIFT ranking highest for predictability, and EVE and ESM-1v ranking fairly low. One possible explanation for this is that the variational autoencoder and language model approaches of EVE and ESM-1v are able to take long-range residue interactions into consideration, which may be very hard to identify with any of the features used here.Fig. 3Performance of trained random forest models for the prediction of AUROC across different VEPs. The figure shows both the Spearman correlation (A) and the coefficient of determination (B) between the predicted and real AUROC values for the testing gene set. Error bars represent the 95% confidence intervals calculated for each trained random forest model based on 100 repeated hold-out cross-validations. We obtained the confidence intervals by computing the standard error and critical value using a t-distribution.Overall, despite considerable variability between methods, our results demonstrate that AUROC for distinguishing between pathogenic and putatively benign missense variants is predictable to some extent across all tested VEPs. To generate final AUROC predictions, we trained our RF models on all 963 proteins from our dataset and we generated predictions for all VEPs across the majority of human protein-coding genes.Multiple features contribute to the predictability of AUROCNext, we investigated which features contributed to the predictability of AUROC for each gene and could explain the heterogeneous performance of VEPs across different genes using Shapley Additive Explanations (SHAP). The SHAP method is a well-known, game-theoretic, model-agnostic framework in explainable AI that allows for the interpretation of any machine learning model, regardless of its complexity, to extract the effects of different features on the model’s prediction43. The method determines the additive contribution of each feature through many different subsets of features, and the final SHAP value for a given feature is the weighted average of its contribution to the model’s output across all possible combinations of features. We leveraged Tree SHAP algorithm which is a fast and exact method to estimate SHAP values for tree models and ensembles of trees under several different assumptions about possible features dependence. We utilised interventional feature perturbation which breaks the possible dependencies between features according to the rules dictated by casual inference43,44,45. The magnitude of the absolute SHAP values provides a direct measure of a feature’s effect or contribution to the model’s predictions.We identified the most important features from our RF models for each VEP. Figure 4A–D present the SHAP values of the most important features for four individual predictors known for their top performance: ESM-1v12, VARITY_R37, EVE46and AlphaMissense17, while Fig. 4D displays the top 30 features ranked by their mean importance across all 35 VEPs (full rankings in Table S2).Fig. 4Features most important for predicting VEP performance. The top 20 important features according to their absolute SHAP values for (A) ESM-1v, (B) VARITY_R, (C) EVE and (D) AlphaMissense. (E) The top 30 important features across all 35 different VEPs, sorted by their average rank. The features are colour-coded based on whether they have a positive or negative impact on the predicted AUROC (e.g. the ‘Multicellular Process’ GO term is associated with higher AUROC, while higher ddG_fold values are associated with lower AUROC).Interestingly, the most important features for VARITY_R, EVE and AlphaMissense, as well as from the combined analysis across all 35 predictors, is the Gene Ontology (GO) term “Multicellular Organismal Process” (GO:0032501). Similarly, the GO term “Developmental Process” (GO:0032502) ranks 3rd overall. Genes associated with these terms tend to have higher AUROC values, as indicated by their more positive SHAP values. One possible explanation for this is that these genes may be more likely to be associated with developmental disorders, with earlier onset and greater penetrance than mutations that cause other genetic disorders. Mutations causing such developmental disorders might be expected to experience stronger purifying selection, on average, than mutations causing other types of genetic disorders, some of which may be later onset or less phenotypically severe. This could make developmental disorder mutations easier to distinguish from the putatively benign gnomAD variants for essentially all VEPs, as they all rely heavily on evolutionary conservation, thus accounting for the higher AUROC values in these genes.Several of the most important features in our models derive directly from the properties of gnomAD missense variants. Although the inclusion of such gnomAD-derived features therefore potentially adds an element of circularity to our AUROC predictions, from a practical perspective, the human population variants will always be available for any given gene, and thus we feel they should be included to maximize our ability to predict AUROC. Most notably, the second overall most important feature across all VEPs, is ddG_fold, which is the mean ΔΔG value (i.e. predicted effect on protein stability) across all putatively benign missense variants from gnomAD, calculated with FoldX47. Our models show that AUROC tends to be higher for genes that have lower ddG_fold. A likely explanation for this is that, for proteins that have structurally less disruptive gnomAD variants, distinguishing them from pathogenic variants is a relatively easier task, in general. Thus, the importance of this feature may not reflect anything about the pathogenic variants, and instead simply reflect the fact that AUROC will tend to be higher for genes with milder “putatively benign” variants. Nevertheless, it is curious to note that ddG_fold, which is related specifically to protein stability, shows greater importance in our models than the vep_gnomad_mean property, which ranks 4th overall, and is the equivalent property calculated using scores from the VEP of interest (e.g., for the ESM-1v AUROC model, vep_gnomad_mean represents the mean ESM-1v score across gnomAD missense variants). Another feature directly derived from gnomAD missense variants, gnomAD_distance, also ranks highly in our models (6th overall). The is the numerator from our recently introduced Extent of Disease Clustering (EDC) metric reflecting the clustering of missense variants in three-dimensional space28, calculated using gnomAD missense variants rather than pathogenic variants. Essentially, proteins with a high gnomAD_distance value will tend to have a low density of gnomAD variants. This suggests that protein-coding genes that are less tolerant of missense variants, particularly in structured regions, will tend to have higher AUROC values across VEPs.The efx_rawfeature is notable for ranking 1st for ESM-1v, but only 20th overall across all VEPs. Importantly, this feature was defined as the median of the ratio of ESM-1v score and the FoldX-calculated ΔΔG for all possible missense substitutions in the full protein, and was introduced in a recent study where it emerged as one of the powerful features for distinguishing between dominant disease genes associated with dominant-negative vs. loss-of-function mechanisms48. The original purpose of this was as an attempt to distinguish between proteins where damaging mutations (as defined by ESM-1v, which should be largely independent of mechanism) were structurally damaging (higher ΔΔG) vs. structurally mild (lower ΔΔG). The fact that it derived from ESM-1v values can explain its high ranking for ESM-1v specifically. Related to this, we found that the pLOF feature introduced in that paper, which can distinguish between dominant disease genes associated with loss-of-function vs. non-loss-of-function (i.e. dominant-negative and gain-of-function) effects, ranks 12th overall, consistent with the previous observation that VEPs tend to perform worse on non-loss-of-function missense variants28,Features related to tolerance of complete loss-of-function variants (i.e., nonsense mutations that result in premature stop codons) were also important in our model. These include the recently introduced Shetconstraint metric (7th overall), which quantifies selective constraints on genes and helps prioritizing essential and disease-causing genes49, n_lof (9th overall), which reflects the number of distinct nonsense variants that have been observed in gnomAD normalised by the protein length, and exac_oe_lof (10th overall), which quantifies the ratio of observed to expected nonsense variants for each gene in the ExAC database. Similar to the missense metrics, the importance of these features shows that protein-coding genes that are less tolerant of protein null variants in the human population tend to be associated with better VEP performance.The presence of intrinsic disorder also appears to play an important role in determining VEP performance. The plddt_disorder feature, which ranks 5th overall, is derived from AlphaFold2 models, and represents the percentage of residues within a protein predicted to be disordered, based on having a predicted local distance difference test (pLDDT) score of less than 5050. pLDDT is a measure at the residue level scaled from 0 to 100 that quantifies AlphaFold2 confidence in the predicted local structure, based on the local distance difference test51. Generally, pLDDT inversely correlates very well with the flexibility of proteins structures such that AlphaFold2 assigns low pLDDT scores for regions that are highly flexible and lack definitive 3D structures such as intrinsically disordered proteins and regions and linkers between ordered regions50,52,53. In addition, mean_plddt, which represents the mean of the pLDDT across all residues, ranks 8th overall. For both features, the SHAP analysis indicates that proteins with more intrinsic disorder will tend to have higher AUROC values. Moreover, the 2nd overall ddG_fold feature is also likely to be closely related to intrinsic disorder, since mutations in the disordered regions of AlphaFold structures will inevitably tend to have very small ΔΔG values due to their lack of intramolecular contacts.Intrinsically disordered regions inflate the apparent performance of VEPsThe above analysis of features important for predicting AUROC leads to a somewhat counterintuitive result. We observed that proteins with greater constraint, i.e., those that are more conserved and less tolerant of missense and nonsense variants, tend to have higher AUROC values. It is also well known that intrinsically disordered regions tend to be less conserved, in general, than ordered regions, although highly conserved disordered regions do exist54. Thus, we might expect that proteins with more intrinsically disordered regions will tend to have lower AUROC values. However, we observe the opposite, with an apparent tendency for variants in proteins with more intrinsic disorder to be better predicted by VEPs.To address this further, we performed two related analyses. First, we split the 963 human protein-coding genes into two groups, calculated based on the median percentage of intrinsically disordered residues. We then compared VEP performance between these two groups (Fig. 5A). Across all VEPs, the proteins with higher disorder content have higher mean AUROC values, with a mean difference of 0.015 across all VEPs. Next, we calculated AUROC for all proteins with the variants at intrinsically disordered residues excluded and compared it to AUROC considered all variants (Fig. 5B). Similarly, across all VEPs, the median AUROC is reduced when intrinsically disordered variants are excluded. This suggests that it is not just that more disordered proteins tend to have higher AUROC values, but that the inclusion of variants at disordered residues in AUROC calculations leads to higher values.Fig. 5Influence of intrinsic disorder on AUROC of VEPs. (A) The distribution of AUROC values of 35 VEPs across genes with different intrinsically disordered content. The ‘High’ group contains genes with a percentage of predicted intrinsically disordered residues greater than the median across the human proteome (12.7%), while the ‘Low’ group contains genes with less than or equal to the median. (B) The distribution of AUROC values calculated with all variants and with variants in disordered regions excluded. Intrinsically disordered residues were defined as those having pLDDT < 0.5. Boxes represent the interquartile range (IQR), while whiskers show the range of data falling within 1.5 times the IQR. VEPs were sorted based on the difference of the median of AUC between high and Low disorder groups.The origin of this phenomenon appears to be related to the fact that intrinsically disordered regions are highly enriched in putatively benign missense variants, while pathogenic variants are much more likely to occur in ordered regions. Moreover, since intrinsically disordered regions tend to be much less evolutionarily conserved than ordered regions, and evolutionary conservation plays a key role in how nearly all VEPs work, variants in intrinsically disordered regions will tend to be “easy” for VEPs to classify as benign due to their low conservation, relative to benign variants in ordered regions. Thus, a greater fraction of benign variants will be correctly classified leading to a higher true negative rate.To illustrate, we consider the example of the transcription factor Pax6, which is mutated in aniridia and other genetic eye disease and contains a mix of both ordered and intrinsically disordered regions. The vast majority of known pathogenic missense variants occur in the paired or homeodomain DNA-binding domains as shown in Fig. 6A, B, which are folded and have experimentally determined structures available55. In contrast, most of the putatively benign gnomAD variants occur in the intrinsically disordered regions. If we use EVE to distinguish between pathogenic and putatively benign variants across the entire protein, we get an AUROC of 0.89. In contrast, if we exclude the intrinsically disordered regions, and thus only attempt to discriminate between pathogenic and putatively benign variants within the DNA-binding domains, the apparent performance is greatly reduced, with an AUROC of 0.75. This suggests that much of the apparent high predictive performance of VEPs for Pax6 comes from being able to predict that variants in the structured DNA-binding domains are more damaging than those in the disordered regions, which is a relatively simple task for most VEPs, because the DNA-binding domains are highly conserved. The EVE performance plots as measured by true-positive rate (sensitivity) and true-negative rate (specificity) for Pax6 are shown in Fig. 6C, D, considering all variants, and excluding disordered variants. We can see that, while the true positive rate is largely unaffected, the true negative rate is increased across thresholds when disordered regions are included, thus accounting for the higher AUROC. However, from a clinical perspective, the primary problem would be in distinguishing pathogenic from benign variants within the DNA binding domains where pathogenic mutations are known to occur, and so this higher AUROC value is not actually reflective of greater clinical utility. This suggests that researchers and clinicians should consider performance specifically in relevant domains or regions expected to be the sites pathogenic variants, rather than considering performance across an entire protein that includes poorly conserved disordered regions unlikely to be associated with pathogenicity.Fig. 6Influence of intrinsically disordered regions on VEP performance in Pax6. Structural location of pathogenic (red) and putatively benign (blue) missense variants on the AlphaFold2 predicted structure of Pax6, showing (A) those variants occurring in ordered regions (pLDDT > 0.5) and (B) those variants occurring in disordered regions. (C) True positive rate (sensitivity) of EVE for predicting pathogenic missense mutations across different thresholds when considering all variants, and when variants at disordered positions are excluded. (D) True negative rate (specificity) of predicting putatively benign missense variants across different thresholds with and without disordered variants included.