Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations

Designing a graph-transformer framework for structure-based protein engineeringIn prior work, we have experimentally shown that representations learned via self-supervised deep learning models on masked microenvironments (MutCompute51,52 and MutComputeX53) can identify amino acid residues incongruent with their surrounding chemistry. These models can be used to “zero-shot” predict gain-of-function point mutations51,52,53,54,55, including in protein active sites of computational structures53. Self-supervised models, however, generate mutational designs that are not biased towards a particular phenotype and do not update predictions based on experimental data19.The MutCompute framework uses a voxelized molecular representation51,53. For protein structures, voxelization is a suboptimal molecular representation, as most voxels consist of empty space and rotational invariance is not encoded. Furthermore, the MutCompute frameworks use convolution-based architectures, which lag behind modern attention-based architectures in terms of representation learning and predictive power.To develop a more powerful and generalizable framework for downstream tasks, we first built MutComputeXGT, a graph-transformer version of MutComputeX (Fig. 1a). Each atom is represented as a node with atomic elements, partial charges56, and SASA57 values as features with pairwise atomic distances labeling edges. Our graph-transformer architecture converts the pairwise distances into continuous and categorical attention biases to provide a structure-based inductive bias for the attention mechanism. To generate likelihoods for the masked amino acid, we average the final-layer hidden representations of all atomic tokens within 8Å of the masked Cα. The design decision to narrow the pooling to atoms in the first contact shell of the masked amino acid is based on insights from systematic variation of the microenvironment volume when training self-supervised 3DCNNs58. With a similar number of parameters and the same train-test splits, MutComputeXGT demonstrates superior representation learning capacity than MutComputeX, achieving wild-type accuracy of 92.98% ±0.26% compared to ~85%53.Fig. 1: Overview of the Stability Oracle Framework.a Self-supervised pre-training graph-transformer architecture (MutComputeXGT). b Fine-tuning of the pre-trained graph-transformer backbone for stability regression (Stability Oracle). In the regression head, we represent a mutation with “FromAA” and “ToAA” CLS tokens, which are the structural amino acid embeddings for the corresponding amino acids. c, d demonstrates how Stability Oracle combines structural amino acid embeddings and one masked microenvironment to generate thermodynamic permutations (TP) augmentation mutation inputs. Here, ΔΔG measurements at PDB:5UCE W43 (yellow transparent spheres) for mutations to both LEU and ARG enable the generation of the TP mutations c from LEU to ARG and d from ARG to LEU by simply swapping the order of the structural amino acid embeddings provided to the regression head. A diagram further describing TP is provided in Supplementary Fig. 2.The Stability Oracle architecture makes use of both the feature-extractor and the classification head of MutComputeXGT for supervised fine-tuning (Fig. 1b). Previous structure-based stability predictors30,31,32,34,59,60 require two structures—either experimental or computational—to explicitly model the wild type and mutant amino acids. This second (mutant) structure is typically obtained using computational techniques such as AlphaFold or Rosetta. The drawbacks of this approach are (1) computational methods become expensive at inference time (as we describe below) and (2) it is difficult to evaluate the quality of computationally derived mutant structures. In contrast, Stability Oracle does not rely on a second structure. More specifically, structural features from the local chemistry surrounding a particular residue (the masked microenvironment) are extracted from a single initial structure, and a mutation is represented as a pair of “from” and “to” amino acid embedding vectors. To model the ΔΔG of a specific mutation, the microenvironment of the initial structure is used to contextualize the “from” and “to” amino acid embeddings in the regression head (as illustrated in Fig. 1b). This architectural design allows the framework to implicitly learn how “from” and “to” amino acids interact with the local chemistry, rather than relying on a computational structure prediction tool to provide chemical interactions. For a typical 300 amino acid protein, prior work would generate 5700 computational mutant structures (from Rosetta34 or AlphaFold22) in order to predict the ΔΔG of every possible single-point mutation during inference. Stability Oracle, on the other hand, requires only one structure to predict the ΔΔG for all 19 amino acid substitutions at every residue ( ~50 ms/residue). Runtime performance metrics on proteins of varying lengths are provided in Supplementary Table 1. The “from” and “to” amino acid embeddings are derived from the weights of the final layer of the MutComputeXGT classifier. This design decision is based on the insight that the weights of these 20 neurons represent the similarity of a microenvironment’s features to each of the 20 amino acids prior to being normalized into a likelihood distribution. Thus, they are structure-based contextualized embeddings of the 20 amino acids self-supervised pre-trained from a 50% sequence similarity representation of the Protein Data Bank (PDB)61, we will refer to them as Structural Amino Acid Embeddings.We highlight several design decisions of the regression head used in Stability Oracle. Of note is the use of a Siamese62 attention architecture that treats the mutation embeddings as two classification (CLS) tokens (Fig. 1b). CLS tokens are commonly used in the natural language processing (NLP) community to capture the global context for downstream tasks63. Since atoms and amino acids are chemical entities at different scales, we designed the regression head so that a particular microenvironment contextualizes the “From” and “To” amino acid-level CLS tokens. Once contextualized for a given microenvironment, the two amino acid CLS tokens are then subtracted from each other to produce a mutation-hidden representation, which is then decoded into a ΔΔG prediction. This design enforces the state-function of the property of Gibbs free energy64, providing the proper inductive bias for Thermodynamic Reversibility and “self-mutations” (where ΔΔG = 0 kcal/mol).Training Stability Oracle to generalize across protein chemistryThe Stability Oracle framework was designed to generalize across all 380 mutation types at all positions within a protein structure. The development of such a model has historically been limited by data scarcity, bias, and leakage. To address these issues, we curated training and test datasets and developed a data augmentation technique called thermodynamic permutations (TP).It is well-known that a major issue with prior works is the inclusion of similar proteins between the training and test set (“data leakage”)42, resulting in poor evaluation of generalization13,14,33,42. It has been demonstrated that train-test splits at the mutation, residue, or protein level result in overfitting to the validation set, and strict sequence clustering is required to ensure proper evaluation of generalization13,14,33,42. Thus, we created new train-test splits based on a 30% sequence similarity threshold computed by MMSeqs248. First, we built the T2837 test set, which we then used to remove any homologous proteins from the remaining experimental data to produce the C2878 training set. The same procedure was used to construct the cDNA117K training set from the single mutant subset that had experimental structures available of the recently published cDNA-display proteolysis Dataset #150 (Fig. 2).Fig. 2: Training and test set generation pipeline.Homologous proteins were identified using MMSeqs248 with a sequence similarity threshold of 30%. Q1744, O2567, and FP (FireProtDB83) represent different datasets. See detailed explanations in the main context.Even with the T2837 expanded test set, we are still unable to assess generalization performance on 14% of the 380 mutation types since they are not represented in T2837. Additionally, T2837 is heavily biased with mutations to alanine (Fig. 3a, bottom row), further hindering our ability to evaluate the generalization of our model. The community has traditionally relied on the data augmentation technique thermodynamic reversibility (TR)46 to generate datasets with expanded mutation type coverage (Supplementary Fig. 2). However, ~3% of mutation types in C2878 + TR and T2837 + TR still lack data (see Supplementary Fig. 3). More importantly, a major drawback of TR augmentation is that all stabilizing mutations it generates are to the wild-type amino acid, as shown in Fig. 3b. These mutations give no predictive power with respect to identifying non-wild-type stabilizing mutations, which is the main goal of thermodynamic stability prediction in the context of protein engineering. To improve the predictive power of deep learning frameworks for stabilizing proteins, additional data for stabilizing mutations not “to” the wild-type amino acid is required.Fig. 3: An overview of the mutation type and ΔΔG distributions for the three proposed datasets and the impact of applying thermodynamic augmentations on these datasets.a Heatmap representation of the mutation type distribution present in C2878, cDNA117K, T2837. For C2878, the original, TP, and original + TP + TR dataset consist of 2878, 18,912, and 24,668 mutations that sample 86.8%, 100%, and 100% of the mutation types, respectively (first row). For cDNA117K, the original, TP, and original + TP + TR datasets consist of 116,641, 2,018,710, and 2,251,992 mutations, respectively, and each dataset samples 100% of the mutation types (second row). For T2837, the original, TP, and original + TP + TR datasets consist of 2837, 7720, and 13,394 mutations that sample 85.5%, 100%, and 100% of the mutation types, respectively (third row). b Comparing the ΔΔG distribution for the original training and test sets with their TP and TR augmentations. All three experimental datasets are biased towards destabilizing mutations (orig). TR augmentation provides additional data biased towards stabilizing mutations and TP augmentation provides additional data that is evenly distributed between stable and destabilizing mutations.To address these issues and improve Stability Oracle’s ability to generalize, we introduce thermodynamic permutations (TP), a data augmentation technique. TP is based on the state-function property of Gibbs free energy, enabling the generation of thermodynamically valid point mutations at residues where multiple amino acids have been experimentally characterized. With TP, we can generate an additional 2.02M, 18.9K, and 7.7K point mutations that sample all 380 mutation types for cDNA117K, C2878, and T2837, respectively. Additionally, TP mitigates several sampling biases in all 3 datasets (Fig. 3a, middle column). First, it provides mutation data for the 13.2% and 14.5% mutation types absent in C2878 and T2837. TP generated data for C2878 and T2837 samples of the 380 mutation types, providing the first training and test sets with experimental ΔΔG measurements for all mutation types (the cDNA display proteolysis dataset does not directly measure ΔΔG but instead derives ΔΔG values from the next-generation sequencing data of multiplexed proteolytic experiments).Figure 3a illustrates the improvement in sampling bias as a softening of red (oversampled) and blue (undersampled) toward white (balanced sampling). In the C2878 and T2837 datasets, this is most apparent for the “to” alanine bias. In cDNA117K, there is an oversampling bias of mutations “from” alanine, glutamate, leucine, and lysine and an undersampling bias for mutations “from” cysteine, histidine, methionine, and tryptophan. TP completely balances the cDNA117K mutation type distribution with each mutation type making up approximately 0.26% of the dataset (100%/380), depicted in Fig. 3a middle column, middle row as uniformly white. Thus, TP augmentation of cDNA117K provides the first large-scale ΔΔG dataset (>1M) that evenly samples all 380 mutation types across 100 protein domains. In contrast to TR, TP does not include stabilizing mutations to the wild-type amino acid and yields a balanced distribution (stabilizing vs. destabilizing) of ΔΔG measurements (Fig. 3b).To develop Stability Oracle framework, we compared training on cDNA117K and/or C2878, with and without TP augmentation, using Structural Amino Acid Embeddings vs. one-hot encodings, and evaluated the performance on all test sets. We observed that Structural Amino Acid Embeddings significantly improve performance compared to the naïve one-hot encoding in Fig. 4a. UMAP visualization of the mutation-hidden representation for T2837 from the Structural Amino Acid Embeddings reveals that the “ToAA” CLS token drives the organization of the latent space and recover known biochemical relationships between the 20 amino acids as illustrated in Fig. 5a. We observe that 1) clustering of hydrophobes (LEU, VAL, ILE, MET), aromatics (PHE, TYR, TRP), and short polars (SER, THR and ASP, ASN) (right panel); 2) isolation of the unique amino acids (GLY, CYS, PRO) (right panel); 3) the unique situation of mutating away from GLY and adding a chiral side-chain (left panel). For a residue-specific case study of the 380 mutation types, see Supplementary Fig. 7. As for the training sets, training the self-supervised representations on cDNA117K + TP + TR provided the best performance overall on regression and classification metrics across the test sets (shown in Fig. 4b/c). While this might have been expected due to the sheer size and mutation-type balance compared to C2878 + TP + TR, it is interesting to note that proteolytic stability of single-domain natural proteins is in fact generally an excellent surrogate for thermodynamic stability (as was pointed out in the original publication50). From this data, the impact of TP on model generalization was unclear. To further examine how TP-augmented datasets impact generalization, we evaluated predictions at mutation types in T2837 + TP that were absent from C2878 + TR but in C2878 + TP + TR, namely the 12 mutation types with no experimental data (see Supplementary Fig. 3). For these mutation types, TP improves generalization: recall improves from 0.28 to 0.4 and precision improves from 0.47 to 0.67 (Fig. 6). We artificially expanded the mutation types that were missing data to 54 and observed similar, but attenuated improvements to both precision and recall (Fig. 6).Fig. 4: Regression and classification performance of different models on T2837 and T2837 + TP.a Trained on cDNA117K, we demonstrate the performance with “Structural AA Embedding” and “One-Hot Encoding”. b Comparison of Stability Oracle’s performance when trained on the C2878 and cDNA117K training sets. c Comparison of Stability Oracle’s performance when trained on C2878 with and without a pre-trained backbone and Structural Amino Acid Embeddings and with and without TP augmentation. d Trained on T2837 + TP, Stability Oracle’s performance when tested on experimental or AlphaFold structures of T2837. We get p-value < 0.001 for all our reported correlation coefficients. Source data are provided as a Source Data file.Fig. 5: Evaluations of Stability Oracle from the FromAA and ToAA perspective.a UMAP visualization of the 128-dim mutation-hidden representation for T2837. Left and right panels are colored by the “FromAA” and “ToAA” in a mutation, respectively. b The experimental distribution of Stability Oracle’s stabilizing predictions (ΔΔG < −0.5 kcal/mol) on T2837 + TP test set for different “from” and “to” amino acid types. Here, stabilizing, neutral, and destabilizing mutations are defined by ΔΔG < −0.5 kcal/mol, ∣ΔΔG∣ ≤ 0.5 kcal/mol, and ΔΔG > 0.5 kcal/mol, respectively. Source data are provided as a Source Data file.Fig. 6: Evaluation of TP on mutation types lacking experimental data in the C2878 training set.We report the accuracy, recall, and precision results (with 0 kcal/mol being the threshold) on two subsets of T2837 + TP, to demonstrate the effectiveness of permutation. On the left, we test the model on the 12 mutation types lacking experimental data in C2878 + TR (missing mutation types can be found in Supplementary Fig. 3). However, this analysis consists of only 54 mutations within T2837 + TP. On the right, we examine the impact of TP by artificially expanding the missing mutation types from 12 to 68 by removing mutation types from C2878 that had fewer than 8 training instances available. This filtered version of C2878 allowed us to evaluate the performance of TP on 663 mutations within T2837 + TP. Source data are provided as a Source Data file.To prevent inflation of Stability Oracle’s classification performance, we focus our evaluations on T2837 + TP (10,557 mutations) and exclude all TR mutations, since these mutations are heavily biased with stabilizing mutations to the wild-type amino acid (see Supplementary Fig. 4). Here, Stability Oracle demonstrated a recall of 0.69, a precision of 0.70, and an AUROC of 0.83 (Fig. 4b). Surprisingly, further fine-tuning with C2878 + TP did not improve performance on T2837 or T2837 + TP. Our analysis reveals all proteins in C2878 are homologous (>30% sequence similarity) to at least one protein in cDNA117K and therefore C2878 does not expand the protein space available for training. This observation provides a rationale for the lack of performance improvement observed upon further fine-tuning on C2878. However, C2878 fine-tuning improves performance on the interface subset of T2837: the Pearson correlation improves from 0.30 to 0.35, and the Spearman correlation improves from 0.29 to 0.35 for the interface microenvironments. This improvement is expected since the cDNA dataset consists of monomeric single-domain structures, lacking interfaces with other proteins, ligands, or nucleotides. However, meaningful improvements are limited due to the scarce amount of protein-protein (127 mutations), protein-ligand (94 mutations), and protein-nucleotide (9 mutations) data in C2878.Since experimental structures are often unavailable, we examined Stability Oracle’s ability to generalize to structures generated by Alphafold222 with the WT, “From”, and “To” amino acid present. We used ColabFold to generate template-free predicted structures for each protein in T283765. ColabFold failed to fold one protein and two structures were removed due to TM-align66 having US-scores < 0.566,67. This resulted in the removal of 50 mutations from T2837. When evaluating T2837 and T2837 + TP using the AlphaFold WT structure we observed no changes in classification metrics and slightly lower performance on regression metrics on T2837 + TP (Fig. 4d). Next, we evaluated the impact on the “From” and “To” AlphaFold structures on the T2837 TP-only dataset (7720 mutations, 100% mutation type coverage) and observed a 2–4% drop in classification and regression metrics (Supplementary Table 4). Overall, these results demonstrate the ability of Stability Oracle to generalize to AlphaFold scaffolds when an experimental structure is unavailable.We conducted several comparisons against the literature. First, we report Pearson correlation coefficients (PCC) (both forward and reverse) on T2837 and all the common test sets. For the common test sets, we compare against several community predictors in Fig. 7 and provide all classification and regression metrics on literature test sets in Supplementary Table 5a. It is worth noting that Stability Oracle outperforms other predictors in the literature even with their documented data leakage issues14. To date, the most accurate and exhaustive thermodynamic stability dataset in the literature is the Gβ1 dataset45. We evaluate Stability Oracle’s performance on Gβ145 and, to the best of our knowledge, achieve SOTA on all 935 mutations (Pearson = 0.75, AUROC = 0.84) and the 835 mutation quantitative subset (Pearson = 0.67, AUROC = 0.81) (full results are provided Supplementary Table 6). Finally, we evaluated Stability Oracle’s structural sensitivity with a case study on p53, an issue previously documented for structure-based stability predictors68. We evaluated three p53 structures (PDB: 2OCJ, 3Q05, 2AC0) that differed in their protein length (94-312, 94-326, 94-293), resolution (2.05, 2.40, 1.80 Å), and biological assembly (homodimer with no DNA, homotetramer complexed with a DNA helix, homotetramer complexed with two DNA helices), respectively, visualized in Supplementary Fig. 1a. This case study demonstrates that Stability Oracle generalizes amid significant structural variations of p53, achieving a Pearson = 0.75 ± 0.02, Spearman = 0.76 ± 0.05, precision = 0.55 ± 0.07, and AUROC = 0.83 ± 0.02 (full results are provided in Supplementary Fig. 1b).Fig. 7: The Pearson correlation coefficient of Stability Oracle and Prostata-IFML across several test sets.We compare against a handful of computational stability predictors from the community (values obtained from the literature and also provided in Supplementary Table 1136,38,40,85,86). Source data are provided as a Source Data file.Evaluating Stability Oracle’s ability to identify stabilizing mutationsFor computational stability predictors to accelerate protein engineering, it is critical that their predictions correctly identify stabilizing mutations. However, it is well documented that SOTA stability predictors can correctly predict stabilizing mutations with ~20% success rate and most stabilizing predictions are actually experimentally neutral or destabilizing13,33. While molecular dynamic-based methods, such as free energy perturbation (FEP), have demonstrated a 50% success rate at identifying stabilizing mutations, their computational demand prevents them from scaling to entire protein applications like computational deep mutational scans (DMS)33,69. Thus, there is a strong need for a method that can match the performance of FEP while being computationally inexpensive.To evaluate Stability Oracle’s ability to identify stabilizing mutations, we filtered its predictions on T2837 and T2837 + TP at different ΔΔG thresholds and assessed the distribution of experimental stabilizing (ΔΔG < −0.5 kcal/mol), neutral (∣ΔΔG∣ ≤ 0.5 kcal/mol), and destabilizing (ΔΔG > 0.5 kcal/mol) mutations. The 0.5 kcal/mol cutoff was chosen based on the average experimental error70. With the ΔΔG < −0.5 kcal/mol prediction threshold, 1770 mutations were filtered with an experimental distribution of 74.0% stabilizing, 17.8% neutral, and 8.2% destabilizing and 48.1% of all stabilizing mutations were correctly identified. A systematic analysis of prediction thresholds is provided in Fig. 8 and Supplementary Table 8a. The success rate of predicting stabilizing mutations (74%) appears to surpass what is typically observed with FEP methods (~50%)33,69 with orders of magnitude less computational cost (Supplementary Table 1). We further examine Stability Oracle’s ability to identify stabilizing mutations by amino acid (Fig. 5b). Here, we observe that Stability Oracle is able to correctly predict stabilizing mutations across most amino acids, whether mutating “from” or “to”. However, several amino acids lack sufficient “from” or “to” stabilizing predictions to draw meaningful conclusions. This data scarcity is even more apparent when looking at the 380 “from”-“to” pairs (see Supplementary Fig. 5), highlighting how data scarcity still hinders proper model evaluation.Fig. 8: Classification comparisons of Stability Oralce (SO), Prostata-IFML (PRO), and RaSP at different ΔΔG thresholds.In the first column, we compare T2837 and observe that Stability Oracle had the highest stabilization and lowest destabilization fraction with similar recall. In the second column, we compare T2837 + P and observe that Stability Oracle still has the highest stabilization and lowest destabilization fraction but Prostata-IFML has a better recall. Source data are provided as a Source Data file.It has been pointed out by the community that experimentally characterized surface stabilizing mutations are biased towards hydrophobic amino acids33. An analysis by Broom et al. of the ProTherm database71 indicates that surface stabilizing mutations typically increase side-chain hydrophobicity (ΔΔGsolvation) with a median change of 0.8 kcal/mol. This hydrophobicity bias is equivalent to an alanine-to-valine mutation on the protein surface33. We examined if this underlying hydrophobic bias persisted within our training pipeline by computing the precision and recall of polar and hydrophobic mutations as a function of relative solvent accessibility (RSA) of the wild-type residue. Our precision and recall results across different RSA bins of T2837 and T2837+ TP indicate that the cDNA117K + TP training set does not produce models biased towards predicting hydrophobic amino acids on protein surfaces (Fig. 9).Fig. 9: Comparing Stability Oracle’s precision and recall between hydrophobic and hydrophilic amino acids as a function of relative solvent accessibility (RSA).The results demonstrate that there are no biases between polar and hydrophobic residues throughout a protein structure on both T2837 and T2837 + TP: we do a test of significance for the absolute value between polar and hydrophobic data examples on the whole dataset (not per RSA bin due to lack of data) and it is insignificant. Here, polar amino acids consist of S, T, N, Q, D, E, R, K, H, and Y and hydrophobic amino acids consist of L, M, I, V, F, W, A. Source data are provided as a Source Data file.Comparing sequence and structure fine-tuned stability predictorsOver the last three years, self-supervised protein large language models (pLLMs or “sequence models”) have had a tremendous impact on the protein community25,72,73,74,75,76,77,78,79.Understanding sequence vs. structure-based prediction models continues to be an active area of research in protein engineering and design19,80. We evaluated Stability Oracle against two computational stability deep learning frameworks: Prostata and RaSP. Prostata26 is a sequence-based framework that fine-tunes ESM2 embeddings ensembling five distinct regression heads (a SOTA protein language model)25 on common training and test sets. However, the Prostata was trained with homologous proteins (a sequence similarity cutoff of 75%) with respect to SSym and S669, resulting in inflated performance on T2837 and its subset test sets (breakdown of the performance and data leakage are provided in Supplementary Table 7). In order to address this data leakage and conduct a fair comparison, we fine-tuned ESM2’s representation using the same training and test sets as Stability Oracle and only the outer-product regression head architecture. We call our version of ESM2’s representations fine-tuned on thermodynamic stability Prostata-IFML. We also compare against the RaSP framework81: a structure-based 3DCNN model that follows a similar training pipeline. Briefly, RaSP is pre-trained with self-supervision on 18Å masked microenvironments sampled from 2315 structures clustered at a 30% sequence similarity and then fine-tuned on 35 DMS datasets computationally generated by the Rosetta Cartesian-ΔΔG program81. In our analysis, we modified the RaSP Colab notebook to generate DMS predictions on every protein in T2837 and every mutation in T2837 + TP.Stability Oracle outperforms or matches Prostata-IFML on every metric (Fig. 10), even though Stability Oracle has 548 times fewer parameters ( ~658M vs. ~ 1.2M) and was pre-trained with 2000 times fewer proteins ( ~ 46M vs. ~ 23K) at the same sequence similarity (50%). As for RaSP, Stability Oracle significantly outperforms it on nearly every classification and regression metric on both T2837 and T2837 + TP. The performance is only comparable for Pearson on T2837 and precision on T2837 + TP.Fig. 10: Comparison of Stability Oracle, Prostata-IFML, and RaSP regression and classification performance on T2837 and T2837 + TP.We refer the readers to Supplementary Section A for detailed results. Source data are provided as a Source Data file.In terms of identifying stabilizing mutations, Stability Oracle also achieves the best performance (Fig. 8). At each prediction threshold, Stability Oracle had the highest proportion of correctly identified stabilizing mutations and the lowest proportion of destabilizing mutations (we exclude −1.5 kcal/mol threshold for T2837 due to data scarcity). It is typical for precision to be inversely proportional to recall, and we observe this tradeoff with regard to Stability Oracle and Prostata-IFML on T2837 + TP, where Prostata-IFML has better recall. We suspect this difference is due to their loss: Stability Oracle uses a huber loss and Prostata uses mean square error. Nonetheless, both Stability Oracle and Prostata-IFML have superior performance on both correctly identifying stabilizing mutations (precision) and identifying more stabilizing mutations (recall) compared to RaSP. A detailed comparison is provided in Supplementary Table 8. In parallel to this work, ThermoMPNN–a deep learning framework that fine-tunes the ProteinMPNN82 representations also on the megascale cDNA proteolysis dataset–was developed. Using the publicly available checkpoint, we found that Stability Oracle outperforms ThermoMPNN on SSym, S669, myoglobin, and P53 across multiple regression and classification metrics (Supplementary Table 10).Finally, we compare all three framework’s ability to predict self-mutations: the “from” and “to” amino acids are the same and the ΔΔG is 0 kcal/mol. Similar to the forward vs. reverse experiments, which assess the thermodynamic robustness of predictors, self-mutations evaluate generalization to trivial examples that were not present in the training set but are inherent in thermodynamics. For wild-type self-mutations on T2837 Stability Oracle, Prostata-IFML, and RaSP achieve RMSE of 0.0033, 0.0018, and 0.8370 kcal/mol, respectively. This demonstrates that Stability Oracle and Prostata-IFML implicitly learn to capture self-mutations. RaSP, however, is unable to generalize to self-mutations and this drop in performance is also observed for TR augmentation of T2837 (Supplementary Table 9c).

Hot Topics

Related Articles