Decoding Missense Variants by Incorporating Phase Separation via Machine Learning

Collection of phase separation-related missense variantsTo investigate relationships between missense mutations and protein phase separation (PS) properties, we reviewed missense variants with altered PS propensity that were documented in PhaSepDB47 and LLPSDB48,49 databases. To minimize the noise effect caused by multiple mutations within a single sequence, we narrowed our selection to mutation records with a limited number of mutations in the sequence that alter the normal PS threshold (annotated as ‘Impact’ mutations). Examples include P22L in Ape1, which solidifies semi-liquid Ape1 Droplets50, and S48E in TDP-43, which disrupts PS51. We limited our analysis to variants influencing proteins’ spontaneous PS, excluding partner-dependent PS.Our compilation yielded a list of 307 experimentally validated ‘Impact’ mutation records from 70 proteins (Supplementary Fig. 1a and Supplementary Data 1), including 79 that strengthened the PS properties (annotated as ‘Strengthen’) and 228 that weakened or disabled them (annotated as ‘Weaken/Disable’). The PScore35 and PhaSePred43 scores indicate that the proteins from which these missense variants originate are predominantly proteins undergoing PS spontaneously, as the PS propensity scores predicted for these proteins are significantly higher compared to those of the human proteome (Fig. 2a).Fig. 2: Analyses of mutations that impact phase separation (PS).a Comparison of the PS propensity of proteins corresponding to collected mutations (70 proteins) with that of the human proteome. (PScore35, Left; PhaSePred-SaPS43, Right; ****P < 0.0001, two-sided Mann–Whitney U test, p = 4.6e-11 and 3.3e-14, respectively; the boxplot components within each violin, from top to bottom are maxima, upper quartile, median, lower quartile, and minima.). b The proportion of ‘Impact’ mutations (Left) located in IDRs and Domains, compared with the total proportion of IDRs and Domains (Right). c The top 30 high-frequency mutations among collected ‘Impact’ mutations. d Distribution of amino acid (AA) distances from each mutation site to the nearest domain boundary. Distances of ‘Impact’ mutations and random ‘Background’ positions were compared within Domains (Left) and within IDRs (Right) (The number of data points were 139, 1000, 202, and 1000, respectively; ****P < 0.0001, two-sided Mann–Whitney U test, p = 4.4e-40 and 1.4e-30, respectively; the boxplot components within each violin plot from top to bottom are maxima, upper quartile, median, lower quartile, and minima). e Distribution of eight pi-contact prediction values (PPVs) for mutation sites. Values of ‘Impact’ mutations (in red) and ‘Background’ mutations (in gray) were compared. The dot in each violin represents the average of values. (NS not significant, **P < 0.01, ****P < 0.0001, two-sample Kolmogorov–Smirnov test; P-values are 0.0029, 0.140, 5.9e-11, 1.2e-7, 0.106, 3.3e-8, 4.5e-6, and 4.1e-14, respectively). f Statistical comparison of the changes of AA property index before and after mutation between collected ‘Strengthen’ (n = 79, orange) and ‘Weaken/Disable’ groups (n = 228, blue) under two-sample Kolmogorov–Smirnov D test (WT wild-type AA, MT mutant AA). The direction of the D statistic was set as positive when the mean value of the ‘Strengthen’ group was higher and as negative when that of the ‘Weaken’ group was higher. Source data are provided as a Source Data file.Properties of variants impacting phase separationWe observed that mutations impacting PS (annotated as ‘Impact’ mutations) are predominantly located in intrinsically disordered regions (IDRs) rather than structured domain areas (Domains) (Fig. 2b). Existing variant effect prediction models primarily rely on evolutionary features28,30,31,52 and structural features32,53,54,55. However, these features are inadequate for representing variants in IDRs and their effects on PS. We found that current advanced pathogenicity prediction models, such as AlphaMissense32 and EVE31, exhibit higher uncertainty of predictions for variants in IDRs than in Domains (Supplementary Fig. 1d). Moreover, they both fail to distinguish ‘Impact’ mutations from random background mutations (Supplementary Fig. 1c). This underscores the urgent need for developing specialized models to predict the effects of mutations on PS.Among these ‘Impact’ mutations, those involving serine, tyrosine, arginine, lysine, and glutamine are the most prevalent (Fig. 2c and Supplementary Fig. 5a). These residues play key roles in PS. Specifically, tyrosine and arginine residues are important PS drivers, with cation-pi interactions between them serving as crucial forces for PS37. Although lysine has a lower ability to form pi-contacts compared to arginine, its interaction with nucleotide is essential for the PS of specific proteins such as tau and DDX3X56,57,58. Additionally, glutamine and serine are vital for cross-beta sheet interaction, which can enhance the propensity of PS37.Interestingly, our analysis revealed that ‘Impact’ mutations tend to occur near domain boundaries. Specifically, the amino acid (AA) distance from each mutation site to its closest domain boundary was calculated. We noted that, whether within Domains or IDRs, ‘Impact’ mutations are significantly closer to these boundaries compared to random mutation sites (P ≤ 0.0001, Fig. 2d and Supplementary Fig. 5b). Furthermore, we observed that pathogenic mutants predicted by AlphaMissense32 also demonstrated shorter distances to domain boundaries compared to those predicted as benign in both Domains and IDRs (P ≤ 0.0001, Supplementary Fig. 1e).In addition, we assessed pi-contact values at mutation sites using the prediction function provided by PScore35, as pi-pi interactions are vital in determining PS. For 6 out of the 8 predicted pi-contact values, ‘Impact’ mutations have higher predicted values than random mutation sites (Fig. 2e). This suggests that mutations affecting PS are likely to occur at sites with high pi-contact frequencies.Next, we compared the changes in AA properties before and after mutation between the ‘Strengthen’ and ‘Weaken/Disable’ groups. We observed that missense mutations that strengthen PS propensity are prone to have higher AA mass (P = 0.0097), hydrophobicity (P = 0.0024), and decreased polarity (P < 0.0001) compared to ‘Weaken/Disable’ mutations (Fig. 2f and Supplementary Fig. 1f). Notably, the increase in hydrophobicity supports the previous discovery that hydrophobicity serves as an important PS driver38,40.Together, these findings show the distinct properties of missense mutations impacting PS and the varying characteristics between mutations that either strengthen or weaken PS. Based on these properties, we have developed tools to predict the impact of mutations on PS, which will be discussed in the following section.PSMutPred for predicting the impact of missense mutation on phase separation tendencyTo predict the impact of missense mutations on protein PS properties, we developed PSMutPred composed of two machine learning (ML) approaches. The first approach termed the ‘Impact Prediction’ (IP) task, trained ML models to predict missense mutations that impact PS. In the second approach termed the ‘Strengthen/Weaken Prediction’ (SP) task, ML models were trained to predict the direction of the shifts in the normal cellular PS threshold induced by ‘Impact’ mutations (Methods). The process is depicted in Fig. 1, indicated by the green section.To encode such changes caused by missense mutations for quantification and model learning, we considered the physicochemical properties of both the wild-type AA and mutant AA at each mutation site, as well as the properties of AAs within IDRs. We applied pi-pi contact frequency35 to encode the mutation site’s underlying significance for PS. Additionally, features such as IUPred59 score and mutation distance to domain boundary were used to encode the position of these mutations. Each mutation sample was converted into a 39-dimension feature vector (“Methods” section). Neither MSA features nor structural features widely adopted in existing mutation-related machine learning prediction28,29,30,31,52,53,60 were considered due to the nature of IDRs.Due to the limited availability of variants that have no PS effect, for each of the 70 proteins, we randomly generated 500 single AA variants, resulting in a total of 35,000 random ‘Background’ mutations, which were further used as negative samples (i.e., ‘Background’ mutations) for analyses (“Methods” section). The dataset of missense mutations was divided into a cross-validation dataset (246 ‘Impact’ samples and 23,500 ‘Background’ samples from 47 proteins) and an independent test set (61 ‘Impact’ samples and 11,500 ‘Background’ samples from 23 proteins) (Methods). We trained prediction models including Logistic Regression (LR), Random Forest (RF), and Support Vector Regression (SVR) for both tasks to explore the discriminative power of predicting missense mutations’ effect on PS.Performance evaluation of PSMutPredAlthough there is currently no computational algorithm for missense mutations on predicting the effect on PS, we attempted to determine if existing PS prediction methods are capable of discerning alterations in PS propensity caused by missense mutations. Five high-performing PS methods were selected, including DeePhase39, PSAP61, PScore35, catGRANULE38, and FuzDrop62. We found that the prediction score differences of ‘Impact’ mutations were higher than those of ‘Background’ mutations, by comparing the absolute differences of prediction scores (Fig. 3a). However, the area under the curve of the receiving operating characteristics (AUROCs) were unsatisfactory (Fig. 3b). Except for FuzDrop62, none of the methods could differentiate between ‘Strengthen’ mutations and ‘Weaken’ mutations (Supplementary Fig. 2c, d), as their prediction scores did not accurately reflect the increase or decrease of PS propensity caused by mutations.Fig. 3: Evaluation of methods’ performance on predicting missense mutations that impact natural phase separation (PS) (‘Impact’ mutation).Methods with abbreviations include DeePha (DeePhase), FuzDro (FuzDrop), and catGRA (catGRANULE). a Discriminative power evaluation of representative PS prediction methods for ‘Impact’ mutations against random ‘Background’ mutations, comparing absolute score changes pre- and post-mutation (P-values computed by a two-sided Mann–Whitney U test, left for ‘Impact’ mutations, n = 307 and right for ‘Background’ mutations n = 35,000; the boxplot components within each violin plot, from top to bottom are maxima, upper quartile, median, lower quartile, and minima.). b Performance evaluation of representative PS prediction methods on discerning ‘Impact’ mutations against random ‘Background’ mutations (IP task). AUROC is based on the absolute score changes. c–h Model performance evaluation in identifying ‘Impact’ mutations. For LOSO, 50 replicates of subset sampling from the background dataset were used to evaluate performance, and the average AUROC and the area under the curve of the precision-recall curve (PRC) (AUPR) were computed and visualized. For LOSO AUPR, data are presented as mean values ± SD (Standard Deviation), and the scatter points represent the distribution of background dataset sampling repeats. c, d Model performance in identifying ‘Impact’ mutations evaluated using leave-one-source-out (LOSO, Left) and an independent test set (Right), measured by AUROC (c) and AUPR (d). e, f A parallel evaluation similar to (c and d) but the ‘Background’ mutations were generated following the same IDRs: Domains ratio as the collected ‘Impact’ samples (weighted sampling). (g, h) A parallel evaluation similar to (c, d) but the ‘Background’ mutations were generated by aligning the frequency of different mutations with their frequency in the impact dataset (AA weighted sampling). We assigned weights to each type of mutation based on the number of occurrences in the impact dataset, with a minimum weight of 1 to ensure all mutation types are considered. Source data are provided as a Source Data file.To test whether PSMutPred-IP can identify mutations that impact PS, we implemented leave-one-source-out cross-validation (LOSO CV) (Method). LOSO is used to evaluate the models’ predictive capabilities for variants from unseen proteins, ensuring their generalizability. AUROCs and AUPRs on LOSO (Fig. 3c, d) showed the accuracy of our models, especially IP-SVR and IP-RF. Evaluation results on the independent test set showed that our models have stable prediction performance when facing mutations from different PS protein categories (Fig. 3c, d).To ensure that the superior performance of the algorithm was not caused by the distribution bias of ‘Impact’ mutations, which tend to be located in IDRs (Fig. 2b, d), we conducted an additional LOSO CV and an independent validation for the IP task. Here we generated ‘Background’ samples maintaining the same IDRs: Domains ratio as observed in the ‘Impact’ samples. LOSO CV results and independent validation results (Fig. 3e, f) were as promising as those from the previous dataset (Fig. 3c, d). We also created another ‘Background’ dataset by aligning the AA substitution frequencies with those in the impact dataset. This was done to test whether the predictive power was due to the high ratio of specific AA types in ‘Impact’ samples (Fig. 3g, h).Moreover, a small proportion of ‘Impact’ mutations are multi-point mutations (Supplementary Fig. 1b), the proposed models still exhibited efficient predictive power and outperformed representative PS models when analyzing the results without considering multi-point mutations (Supplementary Fig. 2a, b). We also assess the performance of PSMutPred-IP by only including naturally occurring mutations in the database, and the model allowed efficient predictive power on these mutations (Supplementary Fig. 5d, e).We next investigated the performance of PSMutPred-SP models. The significant discriminative power to distinguish between ‘Strengthen’ samples and ‘Weaken’ mutation samples from unseen proteins under LOSO CV (SP-LR, P < 0.0001) (Supplementary Fig. 2e, f) and AUROCs on the independent test set (Supplementary Fig. 2g) indicated that SP-LR, SP-RF models can identify the direction of the shifts in PS caused by missense mutations.Feature importance indices were calculated for random forest models of both tasks and grouped into feature types to discover potentially key features (Supplementary Fig. 2h). In both tasks, the pi-contact frequency of the mutation site ranked first, suggesting that the pi-pi interaction at the mutation’s specific location significantly influences the effect of mutations on protein phase separation. Beyond Pi-contact, no single feature was identified as a dominating one, suggesting the non-linear nature among features and a multi-factor causal relationship between features and the PS outcome.Overall, PSMutPred models not only predict missense mutations that impact PS but also predict the direction of the PS-threshold shift. Our final model generates predictions made by the IP-SVR, IP-RF, IP-LR, SP-LR, and SP-RF models as well as their corresponding rank scores (Methods). These models can be employed as effective tools for assessing the tendency of missense mutations to affect PS, enhancing the interpretation of disease variants’ pathogenicity.Experimental validation on the PS-related mutations identified by PSMutPredAberrant phase separation (PS) or aggregation processes have been implicated in the pathogenesis of various diseases, including neurodegenerative disorders, autism7, and hearing loss10,11,12. In this study, we employed PSMutPred to predict the PS impact (‘Impact prediction’ and ‘Strengthen/Weaken prediction’) of missense variants from genes associated with PS-related diseases. Specifically, we selected EPS8, known for its association with deafness, and analyzed the PS impact of its ‘Uncertain’ missense mutations from the ClinVar63,64 database to assess the accuracy of our prediction model.Epidermal growth factor receptor pathway substrate 8 (EPS8) is a multifunctional protein involved in cell mitosis and differentiation65,66,67,68, in capping proteins through side-binding, in bundling of actin filaments69,70,71 as well as in the elongation of actin in hair cell stereocilia65. Prior research has shown that EPS8 localizes to the tips of stereocilia and contributes to the formation of PS-mediated condensates at the stereocilia tip complex11,72. These findings suggest that EPS8 has the capacity for self-phase separation and to interact with other molecules.To validate our predictions, we selected missense mutations, including R265C, D586G, and K676R, from the 8 candidate mutations. These candidates were predicted by PSMutPred-IP to impact PS, either by strengthening or weakening it, with rank scores above 0.5 across all models (IP-SVR, IP-RF, and IP-LR). Among these candidates, D586G had the highest scores for both SP-LR and SP-RF, suggesting a stronger propensity to enhance PS, while K676R scored the lowest for both metrics implying an impairment of PS capacity. As negative controls, we selected R702W and E728V from the 15 candidate mutations (all with IP rank scores below 0.5 for IP-LR, IP-SVR, and IP-RF; at least one had a score below 0.1).For experimental validation, we overexpressed the wild-type and mutants of mouse Eps8 (which is highly conserved with human EPS8, as shown in Supplementary Fig. 3d) fused with a GFP tag in HEK293 cells. Observations made using Olympus fluorescence microscopy highlighted distinct changes in puncta formation quantity to evaluate PS capacity. Specifically, R265C and D585G exhibited a notable increase in the number of puncta, while K675R showed a significant reduction (Fig. 4a, b and Supplementary Fig. 3a, b). Notably, D586G (equivalent to D585G in mice) demonstrated enhanced PS capacity, aligning with its high SP-RF and SP-LR scores. In contrast, K676R (equivalent to K675R in mice) showed diminished PS ability, supported by its low SP-RF and SP-LR scores. To confirm that the observed puncta are indeed a result of liquid-liquid phase separation (LLPS), we also conducted a fluorescence recovery experiment after a photobleaching (FRAP) experiment to validate the dynamic and rapid formation of droplets observed of both wild-type and mutant (D585G) of mouse EPS8 in HEK293 cells (Supplementary Fig. 4h–j).Fig. 4: Experimental validation of Eps8 missense mutations predicted by PSMutPred to impact PS.a Representative images of overexpressed GFP-Eps8 and its mutants in HEK293 cells (scale bars: 10 μm; n = 10 randomly picked cells). WT denotes wild type. b Quantification of puncta within the wild type and mutants of Eps8 in HEK293 cells (n = 10 randomly picked cells; ****P < 0.0001 by two-tailed Student’s t test, p = 8.1e-9 and 4.0e-10, respectively). Error bars represent SD, and center lines represent mean values. c Ribbon diagram representation of mouse EPS8 structure predicted by AlphaFold2, showing both front (left) and back (right) views. d–f Detailed regions involving missense mutations with their neighboring residues (d, f), and interaction analysis (e). The mutations are shown with the stick mode in red while hydrogen bonds are shown as blue dashed lines. Sequence alignments within critical residues are shown in bold. Source data are provided as a Source Data file.Eps8 comprises six domains, encompassing both well-structured regions and intrinsically disordered regions (IDRs) (Supplementary Fig. 3c). D585 in mice (equivalent to D586 in humans) is situated in the SH3 domain. It forms stable hydrogen bonds with the sidechains or backbones of K536 and K538 in mice (equivalent to K537 and K539 in humans), maintaining structural stability through either a β-sheet-loop conformation in mice or β-sheet-β-sheet conformation in humans (Fig. 4d, e and Supplementary Fig. 3e, f). Additionally, D585 and its neighboring residues, as well as interaction residues, exhibit high conservation. However, the substitution of glycine for aspartate disrupts the formation of these hydrogen bonds, leading to a destabilized structure. This change might enhance PS by increasing flexibility and modifying the β-sheet-loop structure. Conversely, K675 in mice (equivalent to K676 in humans) is located in an IDR at the C-terminus of EPS8 (Fig. 4c and Supplementary Fig. 3d). Sequence alignment underscores the high conservation of K675 and its adjacent residues (Fig. 4f and Supplementary Fig. 3g). Lysine’s characteristic long and hydrophobic side chain increases the likelihood of extending and ‘capturing’ polar residues from self-proteins or other proteins. This substitution may alter these interactions, potentially impairing PS.In summary, our study demonstrates that the changes in PS propensity, as predicted by our model, accurately correspond to a distinct number of puncta formed in live cells. This serves to underscore the utility and accuracy of PSMutPred.Phase separation effect prediction of disease variants by PSMutPredPhase separation turns out to be a general mechanism for protein condensate assembly forming as membrane-less organelles in various physiological processes9,73,74,75,76,77,78,79,80,81,82. Mutations that change phase separation (PS) are likely causes of disease13,22,23,83. To exam this theory, we divided missense variants corresponding to 8611 human genes from ClinVar63,64 (522,016 variants in total, downloaded in 2022.12) into PS-prone and low-PS-prone propensity groups for between-group analyses (Methods).By comparing the PS-prone group defined by 83 PS ClinVar proteins with the low-PS-prone group (other proteins), we found that in the PS-prone group, the PSMutPred-IP scores of variants (n = 1451) were skewed to pathogenicity compared to those from low-PS-prone group (n = 86,291) (Fig. 5a). To make the sizes of the two groups more comparable and improve the efficiency of the comparison, we also defined the PS-prone group by collecting PS proteins predicted by PScore35 and a meta-predictor PhaSePred43 (1276 proteins, 30,889 variants) and grouped variants from other proteins as low-PS-prone group (56,853 variants). The between-group analysis showed consistent results (Fig. 5b). However, when conducting the between-group analysis using representative PS prediction methods trained on phase-separation proteins, the outcomes for DeePhase39 and catGRANULE38 displayed inconsistencies across the analyses (Supplementary Fig. 6a, b). For the other three methods, which include PSAP61, PScore35, and FuzDrop62, the PS-prone group did not demonstrate significantly higher Pearson correlation coefficients than the low-PS-prone group (Supplementary Fig. 6a, b).Fig. 5: Evaluation of PSMutPred scores across ClinVar64 variants.a–d Comparison of variants’ Pearson correlation between groups. Groups include a PS-prone group (83 known PS proteins, 1451 variants) and a low-PS-prone group (8528 proteins, 84,840 variants) defined by PS proteins, and a predicted PS-prone group (1276 proteins, 30,889 variants) and a predicted low-PS-prone group (7335 proteins, 56,853 variants). (two-tailed P-values computed by sci-kit learn pearsonr package; *P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001; NS = no significance). a Comparison between the PS-prone group and the low-PS-prone group. P-values are 1.7e-8, 5.8e-8, 2.6e-4, 3.4e-5, 2.4e-14, and 1.2e-275 respectively. b Comparison between the predicted PS-prone group and the predicted low-PS-prone group. P-values are 9.9e-82, 0.01, 4.9e-53, 0.02, 3.6e-198, and 7.7e-223 respectively. c Comparison between variants located in IDRs (n = 15,427) and Domains (n = 15,462) within the predicted PS-prone group. P-values are 6.4e-61, 2.8e-13, 1.7e-23, 6.0e-37, 1.5e-149, and 1.4e-107 respectively. d Comparison between variants from neurodegenerative disease (ND) related proteins (19 proteins, n = 252) and variants from other proteins (non-ND) (within the predicted PS-prone group). P-values are 0.88, 3.1e-8, 0.005, and 2.6e-95 respectively. e AUROC scores of PSMutPred-IP models on pathogenicity prediction of IDR missense variants from the PS-prone group (n = 489 variants). f A parallel evaluation of (e) but focuses on the predicted PS-prone group (n = 8188). g Comparison of the proportion values defined by different PSMutPred-IP models, including IP-RF (top), IP-LR (middle), and IP-SVR (bottom). Comparison of the PS-prone group and the low-PS-prone group on the left (PS proteins), and comparison between the predicted PS-prone group and the predicted low-PS-prone group (Predicted-PS proteins). Differences are based on 2-sample Kolmogorov’s D statistic, with positive values indicating higher proportions in the PS-prone group and negative values indicating higher proportions in another. Source data are provided as a Source Data file.IDRs and structured domains are both crucial for phase separation6. When analyzing the PS-prone group defined by algorithms, we found that PSMutPred-IP, along with other sequence-based PS metrics (except PSAP), showed a significant positive correlation with the pathogenicity of variants in both IDRs (n = 15,427) and domains (n = 15,462) (Fig. 5c and supplementary Fig. 6c).Additionally, we isolated variants likely in IDRs (unmapped by PfamScan84,85, with IUPred359 score > 0.5) from the PS-prone group and directly used PhaSePred-IP scores to predict pathogenicity ((Likely) pathogenic as 1 and (Likely) benign as 0). Separate analyses were conducted for the PS-prone group as defined by PS proteins and by algorithms (Fig. 5e, f). The resulting AUROC scores indicated that PSMutPred-IP can identify disease variants that lead to PS alterations.Studies indicate that neurodegenerative lesions may be associated with excessive PS25,86. Within the predicted PS-prone group, 19 proteins were identified to be highly associated with neurodegenerative disease (Methods). SP prediction scores for the variants (n = 252) of these proteins showed different patterns compared to other variants (n = 19,266). Specifically, among these 252 variants, those predicted to strengthen PS were more inclined towards pathogenicity compared to those predicted to weaken PS. This observation was reflected by the positive Pearson coefficient values for both SP-LR and SP-RF models, contrasting with the negative coefficient values for variants from other PS proteins (Fig. 5d). catGRANULE38 and FuzDrop62 also demonstrate a higher Pearson correlation coefficient than the overall pattern (Supplementary Fig. 6d).Additionally, we observed that mutations within IDRs of PS proteins are more likely to affect PS than those in proteins less likely to undergo PS. This conclusion was based on evaluating the proportion of variants of ‘Uncertain Significance’ (VUSs) predicted to impact PS for each protein (see Methods). Using Kolmogorov’s D statistic to compare these proportions, we observed that the PS-prone group had a higher incidence of PS-affecting variants than the low-PS-prone group, as shown in Fig. 5g. We found that only for PSAP61 and PScore35, PS-prone proteins have a higher proportion of missense mutations predicted to ‘Impact’ PS (Supplementary Fig. 6e).In summary, our comprehensive analyses revealed a notable clustering of pathogenic missense mutations with an impact on PS propensity in proteins with inherently higher PS propensity. Additionally, our findings indicate that compared to mutations that might weaken PS, gain-of-PS mutations tend to aggregate specifically in disease mutations associated with neurodegenerative-related genes. These results not only provide valuable insights into the relationship between PS and disease but also underscore the reliability and validity of PSMutPred.Introducing a feature for the pathogenicity prediction of disease variantsCurrent variant interpretation methods heavily rely on evolutionary features28,30,31,52,87 generated by multiple sequence alignment (MSA). As IDRs, especially those with poor evolutionary conservation2, challenge the effectiveness of traditional features7, using two representative pathogenicity prediction models, including EVE31 and ESM1b46,88, we aim to test whether PS-related features can address the IDR gaps left by evolutionary features in pathogenicity prediction (Fig. 1, orange section). EVE predicts pathogenic variants by thoroughly leveraging MSA information, demonstrating that this feature alone can predict the impact of most known disease mutations, proving the dominant role of the evolutionary feature in this field. ESM1b utilizes large language models (LLM) to learn protein information across species, modeling the space of known protein sequences selected throughout evolution, and can thus be considered an advanced representation of evolutionary features.We first selected EVE and applied a straightforward approach, that combined the unsupervised EVE score with a simple feature group, including PSMutPred scores as the variant feature (Methods). We then trained three models including RF, SVR, and LR, using ClinVar64 significances as labels. Their performances were evaluated using both blocked 3-fold cross-validation and an independent test set (Supplementary Data 3) (Methods). We observed that all three combined models demonstrated improved pathogenicity prediction, with RF showing the best performance in terms of AUROC and AUPR scores (Supplementary Fig. 4a, b, and Supplementary Table 1). We subsequently selected the RF model as the combined model for further analysis, based on validation on the independent test set.Given that IDRs typically exhibit poorer evolutionary conservation than Domains7, it is unsurprising that we found EVE to be less effective in predicting IDR variants compared to Domain variants (Fig. 6a, b and Supplementary Table 1). As expected, the combined model led to a more pronounced improvement in identifying IDR disease variants (n = 5656) than those within Domains (n = 9738). Specifically, the combined model showed a 4.3% improvement in AUROC and a 7.6% improvement in AUPR for IDR variants, compared to a 2.6% AUROC improvement and only a 1.7% AUPR rise for Domain variants (Fig. 6a, b and Supplementary Table 1). The Mann–Whitney test further indicated a significant improvement in the prediction of both pathogenic variants as well as benign variants in IDRs (Fig. 6c, d). Additionally, we consider that IDRs include a small subset of potentially conditionally folded IDRs with high evolutionary conservation4, characterized by high AlphaFold23-predicted confidence scores (pLDDT scores). For the 5656 IDR mutations in the test set, we mapped their pLDDT scores and categorized them into a high pLDDT group (pLDDT ≥ 70) and a low pLDDT group (pLDDT < 50). We found a more pronounced improvement in AUPR for IDR variants with low pLDDT scores compared to those with high pLDDT scores (9.8% compared to 3.9% AUPR improvement, Fig. 6e). This indicates that current PS-related features can supplement the evolutionary feature’s weakness in IDRs, especially for low conservation IDRs.Fig. 6: Analysis of phase separation-related feature contributions to pathogenicity prediction.a–e Pathogenicity prediction performance evaluation of the model combining EVE with PS-related features. a AUROC (Left) and AUPR (Right) evaluations on the independent test set (n = 15,394). The purple line represents the model trained with both EVE and PS features; the green line represents the EVE score alone. b AUROC (Left) and AUPR (Right) evaluations specifically on variants within IDRs from the data set analyzed in (a) (n = 5656). c, d The divergence of predicted scores distributions between the standalone EVE (green) and the combined model (purple), quantified using a two-sided Mann–Whitney U test on the independent test set (****P < 0.0001; P-values are 2.4e-27; 1.7e-15, 9.6e-59; and 7.2e-293 respectively, the boxplot components within each violin plot, from top to bottom are maxima, upper quartile, median, lower quartile, and minima.). c Score distributions for pathogenic-prone variants (pathogenic and likely pathogenic, n = 2044, left graph) and benign-prone variants (benign and likely benign, n = 3612, right graph) with a focus on variants located in IDRs. d A parallel evaluation of (c) but focusing on variants located in Domains (6665 pathogenic or likely pathogenic and 3073 benign or likely benign). e Evaluation of IDRs variants with high AlphaFold2 pLDDT scores (pLDDT ≥ 70, n = 2763) and low pLDDT scores (pLDDT < 50, n = 2407). f–i Pathogenicity prediction performance evaluation of the model combining ESM1b with PS-related features. f Evaluation of the model trained with ESM1b and PS features using 5-fold cross-validation under the ClinVar dataset (n = 140,321). g Evaluation of IDRs variants with high AlphaFold2 pLDDT scores (pLDDT ≥ 70, n = 36,032) and low pLDDT scores (pLDDT < 50, n = 25,755). h, i Pathogenicity prediction for 1,015,769 ClinVar VUSs by combining PS features with ESM1b scores. Source data are provided as a Source Data file.To investigate the role of PSMutPred in pathogenicity prediction, we examined the effect of including PSMutPred scores within various feature combinations, and their performance on the test set was quantified by AUROC and AUPR (Supplementary Fig. 4c). The comparative analysis highlighted a significant enhancement in the prediction accuracy when PSMutPred scores were incorporated. Next, we analyzed ClinVar mutations predicted by EVE with opposite outcomes (pathogenic/likely pathogenic mutations with EVE scores < 0.5 and benign/likely benign mutations with EVE scores ≥ 0.5) to evaluate the PSMutPred’s discriminative power on these mutations. We focused on mutations within disordered regions of potential phase-separating proteins (n = 600, Supplementary Fig. 4d). PSMutPred scores for false negatives of EVE are significantly higher than those for false positives of EVE, with P-values of 1.3e-4, 9.7e-5, and 9e-3 for IP-RF, IP-SVR and IP-LR, respectively, as evaluated by the Mann–Whitney test. This shows that PSMutPred can capture information of mutants where evolutionary features failed.Subsequently, similar to the process of appending features to EVE, we combined PS-related features with the ESM1b score, which can be considered as an advanced representation of evolutionary features. We mapped ESM1b scores to 140,320 ClinVar variants (Supplementary Data 3) and tested the combined RF model using a blocked 5-fold cross-validation (Methods). The combined model also achieved improved prediction accuracy (Fig. 6f), especially for variants located in IDRs with low pLDDT scores (Fig. 6g). This consistent result indicates that PS-related features can address the weaknesses of evolutionary features in predicting IDR variants, particularly those in IDRs with low conservation. Using the PS features, combined with ESM1b scores, we predicted pathogenicity scores for 1,015,769 ClinVar VUSs (Supplementary Data 4) (“Methods” section). Among them, 527,524 are IDR variants (Fig. 6h), 9.3% of them were predicted pathogenic, and 78.4% were predicted benign (Fig. 6i).We chose EVE, ESM1b due to their unsupervised nature, which can offer an unbiased baseline, making testing with our features straightforward. These findings reveal that the PS-related features including variants’ impact on PS serve as a valuable encoding for IDR mutations and can be integrated into pathogenicity prediction models in the future to provide a better interpretation of pathogenicity variants.

Hot Topics

Related Articles