An end-to-end framework for the prediction of protein structure and fitness from single sequence

Network architecture of SPIREDThe SPIRED model mainly consists of four Folding Units (Fig. 1a, Algorithm 1 of Supplementary Information 5.1). When predicting the protein structure, SPIRED only requires the amino acid sequence of the target protein, which is encoded into high-dimensional embedding (1D information) by the ESM-211 language model. The sequence embedding is then fed into the Folding Units, in each of which 1D and 2D information mutually updates and multiple sets of coordinates of Cα atoms are predicted. Unlike mainstream methods like AlphaFold2 and ESMFold that employ the 1D information to predict the atom coordinates in the global coordinate system (i.e. the laboratory coordinate system in which a protein structure is determined experimentally), in each Folding Unit of SPIRED, the 2D information is used to predict a total of L (i.e. the number of residues) sets of relative coordinates for the Cα atoms, each taking the local coordinate system of an individual residue (i.e. the coordinate system in which Cα atom is placed at the origin, C atom is placed on the x-axis and N atom is placed in the xy plane) as the reference frame. Since both outputs and labels (i.e. relative Cα coordinates in individual local frames) are ro-translationally invariant, our design avoids the equivariant operations that usually augment the computational complexity. The multiple sets of Cα coordinates predicted by the last Folding Unit, along with pLDDT and main-chain torsional angles (also composed of 2D matrices), are passed to GDFold231, an in-house folding algorithm based on gradient descent optimization, for main-chain adjustment and side-chain packing, resulting in the full atomic coordinates of the protein.The network structures of the first three Folding Units are essentially the same. Here, we take Folding Unit1 (Fig. 1b, Algorithm 2 of Supplementary Information 5.1) as an example to illustrate the basic network architecture of Folding Units. For Folding Unit1, the 1D feature is the sequence embedding provided by ESM-2 and the 2D feature is initialized as a zero-valued tensor. While for the other Folding Units, the input 1D and 2D features are generated by the preceding Folding Unit. Within a Folding Unit, the 1D and 2D features are first updated by the Triangular Self-Attention module7,11,59. The new 1D feature is directly passed on to the next Folding Unit, whereas the updated 2D feature goes through Instance/Row/Column Normalization operations60 (I/R/CN in Fig. 1b, Algorithm 4 of Supplementary Information 5.1), followed by the coordinate prediction module Pred-XYZ (Algorithm 5 of Supplementary Information 5.1). The first Pred-XYZ module predicts the absolute Cα coordinates and generates a new 2D feature that is passed on to the next Pred-XYZ module, while the second Pred-XYZ module predicts additional corrections to the Cα coordinates. The two Pred-XYZ modules have shared parameters. The pairwise distances between Cα atoms are then calculated from the predicted coordinates (Algorithm 6 of Supplementary Information 5.1). The distance matrix, along with the 2D feature, is then passed to ConvBlock (Algorithm 7 of Supplementary Information 5.1), resulting in a new 2D feature that enters the next Folding Unit.The network architecture of Folding Unit4 (Algorithm 3 of Supplementary Information 5.1) is slightly more complex than the other Folding Units as it engages six Pred-XYZ modules for coordinate prediction and updates, where the first four Pred-XYZ modules and the last two Pred-XYZ modules are constrained by two slightly different versions of the RD Loss (see Algorithm 11 and Algorithm 12), respectively, and are thus designed to have two separate sets of shared weights. The coordinates updated by the last Pred-XYZ module in Folding Unit4 serve as the final Cα coordinates.In addition, the 2D feature generated by Folding Unit4 is also utilized to predict the Cβ distance distribution, dihedral and scalar angles quantifying inter-residue orientation (Algorithm 8 of Supplementary Information 5.1) as well as main-chain torsion angles (Algorithm 9 of Supplementary Information 5.1). Incidentally, each Folding Unit has the capacity to predict pLDDT (Algorithm 10 of Supplementary Information 5.1), and we consider the pLDDT values output from Folding Unit4 as the representative ones. Finally, since the sequential arrangement of multiple Folding Units yield similar benefits for structure refinement akin to the recurrent expansion by recycling (Supplementary Fig. S1, see Supplementary Information 1.1 for details), recycling is abandoned by default (i.e. Cycle = 1) to accelerate inference but could be optionally activated (e.g., Cycle = 4) in SPIRED.Relative displacement loss in SPIREDDuring the training process of SPIRED, the RD Loss (Fig. 1c, Algorithms 11 and 12 of Supplementary Information 5.2) is utilized to constrain the Cα coordinates predicted by each Folding Unit. RD Loss is a loss function that is designed to achieve the constraining role of the FAPE Loss7 in a computationally less intensive manner. In comparison to the FAPE Loss, it circumvents the laborious coordinate alignment and the costly prediction of rotation matrices, but focuses on evaluating the average prediction accuracy of relative displacement vectors between each pair of Cα atoms in the multiple reference local coordinate systems.Before calculating the RD Loss, a local coordinate system is established for each individual residue, where Cα is set as the origin and the basis vectors are determined from the positions of Cα, C and N atoms, following the AlphaFold27 definition. SPIRED predicts the Cα coordinates of all residues in each local reference frame. As shown in Fig. 1c, in the local coordinate system of residue k, the relative displacement between a pair of residues i and j is evaluated for the predicted structure (\({\overrightarrow{\tilde{x}}}_{ij}={\overrightarrow{\tilde{x}}}_{kj}-{\overrightarrow{\tilde{x}}}_{ki}\)) and the ground truth (\({\overrightarrow{x}}_{ij}={\overrightarrow{x}}_{kj}-{\overrightarrow{x}}_{ki}\)), respectively. The RD Loss is then computed as the difference between the predicted and ground truth vectors averaged over all residue pairs and over all reference frames. In contrast, almost all mainstream structure prediction models (e.g., AlphaFold2, ESMFold and OmegaFold) use the FAPE Loss, which requires predicting quaternions to achieve rotation and laboriously aligning the predicted and true coordinates. Although the offsets between predicted and true coordinates are also evaluated in the FAPE Loss, the inter-residue relative displacement \({\overrightarrow{\tilde{x}}}_{ij}\) is not specifically considered. Therefore, the RD Loss brings two advantages for training the structure prediction model. Firstly, the RD Loss avoids predicting the rotation matrices, only requiring the prediction of relative positions between residues, thereby alleviating the difficulty of model training of SPIRED. Secondly, the RD Loss places more focuses on the relative displacement between residues, a metric that is more intensively correlated with the inter-residue vibrations rather than the global translation and rotation.Besides the RD Loss, the inter-residue distance and angle distribution losses are computed based on the Cβ distance distogram as well as the dihedral and scalar angles of all residue pairs following the trRosetta61 definition and are utilized as auxiliary losses for the training of SPIRED. In addition, the Cα distance loss (Algorithm 13 of Supplementary Information 5.2), pLDDT loss (Algorithm 14 of Supplementary Information 5.2) and Cα clash loss (Algorithm 15 of Supplementary Information 5.2) are also computed as auxiliary losses. Details about the implementation and combination of these losses are described in Supplementary Information 4.1.Network architecture of SPIRED-Fitness and SPIRED-StabThe SPIRED-Fitness model engages ESM-2 and SPIRED as the extractors for 1D and 2D information, respectively (Fig. 1d). The downstream Fitness Module is mainly composed of the Geometric Encoder that adopts the Graph Attention Network (GAT) architecture (Algorithms 16, 17 and 18 of Supplementary Information 5.3) to iteratively update the node and edge features provided by ESM-2 and SPIRED. Specifically, the node feature is initialized by the sequence embedding of ESM-2 (650M), whereas the edge feature includes the multiple sets of Cα coordinates and the pLDDT values predicted by SPIRED. The updated node and edge features are then fed into MLP (i.e. multiple layer perceptron) layers for the prediction of fitness changes caused by single and double mutations, respectively. Notably, in the prediction of single mutational effects, the fitness landscape is generated from the 1D MLP output in combination with the ESM-1v42 logits (i.e. logits before Softmax operation in the last output layer of the ESM-1v model), following the procedure of our prior work in GeoFitness v125 (see Supplementary Information 3.3 for a brief introduction). As for the prediction of double mutational effects, the fitness scores of all possible mutations for each residue pair are predicted from each individual term of the 2D MLP output directly.Since the SPIRED-Fitness model could be sufficiently optimized by the abundant DMS data to learn the general mutational effects, reutilization of SPIRED-Fitness modules in SPIRED-Stab would effectively overcome the challenge of limited amount of data in the protein stability prediction. A similar idea has been validated in our prior work on GeoDDG/GeoDTm v125 (see Supplementary Information 3.3 for a brief introduction). Specifically, majority of the SPIRED-Fitness model (ESM-2, SPIRED and the Geometric Encoder, as enclosed by a dashed box in Fig. 1d) is directly implanted into SPIRED-Stab with the same network architecture and parameters, followed by MLP layers for the prediction of stability score (Algorithms 19 and 20 of Supplementary Information 5.3). Noticeably, SPIRED-Stab retains shared weights for the two channels of inputs, i.e. the wild-type and mutant sequences, and the difference of their prediction scores is then scaled to predict the absolute values of ΔΔG and ΔTm, a similar design to our prior GeoDDG/GeoDTm v1 models that intrinsically guarantees the antisymmetry of prediction results.Training set for SPIRED structure predictionFirst, we collected protein structures available until March 2022 from the PDB29 database, but filtered out the structural files with >5 polypeptide chains and with resolution >5 Å. Then, we split the remaining structures into multiple single chains and retained chains with length between 40 and 1,200 residues. Next, we clustered these chains using MMseqs262 easy-cluster with the sequence identity threshold of 100% and only kept the representative chains of clusters, which finally resulted in 113,609 chains. We also utilized domain structures from the CATH35 database (v4.2, S35) as supplementary training data, which contained 24,183 domains with length ranging from 63 to 600 residues.Training process for SPIRED structure predictionAs shown in Supplementary Table S12, the training process of SPIRED is mainly divided into four stages, during which the learning difficulty is continually enhanced (e.g., by including hard protein samples or increasing the cropping size), allowing the model to grasp the protein sequence-structure relationship gradually. Technical details of the four stages are shown as follows:First stage, we performed clustering on 101,915 polypeptide chains (before May 2020) with 30% sequence identity using MMseqs2, which resulted in 24,179 clusters. We trained SPIRED for ~10,000 update steps with the clustered PDB chains, where one chain was iteratively chosen from every cluster in each epoch. During this process, the learning rate was linearly warmed up from 10−6 to 10−3 in the first 1,000 updates, retained at the peak value of 10−3 for the next 6,500 updates, and declined down to 5 × 10−4 for the final 2500 updates.Second stage, we selected an “easy subset” with length <400 residues and resolution <3 Å from the whole training set. We then trained SPIRED with the ~63,000 “easy subset” chains for ~8,000 updates. The learning rate was declined from 5 × 10−4 to 10−4 in this stage.Third stage, we used the whole training set, containing 113,609 PDB chains (before March 2022) and 24,183 CATH domains, to train SPIRED for ~23,000 updates, with the learning rate annealed from 10−4 to 5 × 10−5. The cropping size was kept at 256 in the first three stages.Fourth stage, we trained SPIRED for 18,000 updates with the cropping size expanded to 350, and kept the cropping size at 420 for the next 12,000 updates. The learning rate was annealed from 5 × 10−5 to 10−5 during this stage.The batch size was fixed to 64 and the Adam optimizer was used throughout the training process of SPIRED.Test sets for protein structure predictionWe used two test sets to evaluate the performance of structure prediction methods. The first test set was constructed from CAMEO26 targets (August 2022  ~ August 2023), consisting of 680 protein chains with the length ranging from 50 to 1,126 residues (Supplementary Data 6). The second test set was composed of 45 protein domains released from the CASP1528 official website (Supplementary Data 7).We used two kinds of structure classification databases, SCOPe33 database (v2.08, S95, September 2021) and CATH35 database (v4.2, S35, July 2017), to evaluate the structure prediction power on different types of protein backbone folds or topologies. We selected domains from SCOPe with length ranging from 50 to 800 residues, resulting in 1231 folds and 34,021 domains in total. Similarly, 1223 topologies and 24,183 domains were collected from CATH.Training and test sets for protein fitness predictionWe utilized DMS data to train and test fitness prediction models, which included data from three different sources.cDNA proteolysis dataset37. Tsuboyama et al. constructed a library in which mutated proteins were covalently linked to cDNA. These proteins were subsequently subjected to proteolysis, and the cDNA fragments connected to those proteins that were not cleaved could be detected through sequencing, allowing the determination of the quantity of intact proteins at different protease concentrations. Due to the fact that mutated proteins with lower folding stability are more susceptible to proteolytic cleavage in the experiment, protein ΔG values can be estimated using protein cleavage rate data and Bayesian inference. This experimental method facilitates the large-scale analysis of the impact of mutations on protein stability, enabling the examination of folding stability across 900,000 protein domains in a week. From the data provided in the article, we selected 412 proteins with length ranging from 32 to 72 residues to compose a dataset for protein fitness prediction. Of these proteins, 153 proteins have data for both single and double mutations, while the rest only have data for single mutations.MaveDB38,39 is a database that contains fitness data of mutated proteins obtained from DMS experiments and massively paralleled reporter assays, including enzymatic activity, binding affinity, etc. We selected 51 proteins from MaveDB for the training and testing of our models.DeepSequence Dataset40 collects fitness data of mutated proteins from DMS experiments. After filtering out data that are redundant with the MaveDB database, we finally retained 22 proteins from this dataset for the subsequent fitness training and testing purposes.Details of the combined MaveDB/DeepSequence datasets could be found in Supplementary Data 8. The data from all of the three aforementioned sources collectively constituted a dataset of 485 proteins, consisting of ~693,000 single mutations and ~265,000 double mutations. For each protein, all fitness data were randomly assigned for training, validation and testing with a ratio of 7:1:2.Training process for SPIRED-FitnessThe training of SPIRED-Fitness could be mainly divided into two stages.In the first stage, the SPIRED parameters were frozen and only parameters of the Fitness Module were updated for ~400 epochs. The learning rate was initially set to 10−3 and was adjusted following the learning rate scheduler of ReduceLROnPlateau (factor = 0.5, patience = 10). The Fitness Module corresponding to the best performance of the fitness prediction on the validation set was used for continued training in the next stage (Fitness Module hyper-parameters: node_dim = 32, pair_dim = 32, N_head = 8, N_block = 2, see Algorithms 17 and 18 of Supplementary Information 5.3). When calculating the loss of this training stage, single mutations and double mutations are combined as a comprehensive mutation set. The Soft Spearman Loss48 (see Supplementary Information 4.2 for details) between the predicted fitness scores and the ground truth values is computed within this mutation set (Eq. (1)).$${{{\rm{Fitness}}}}\_{{{\rm{Loss}}}}={{{\rm{Soft}}}}\_{{{\rm{Spearman}}}}\_{{{\rm{Loss}}}}(\{{{{\rm{single}}}}\_{{{\rm{mutation}}}}\}\cup \{{{{\rm{double}}}}\_{{{\rm{mutation}}}}\})$$
(1)
In the second stage, both the SPIRED module and the Fitness Module were allowed to update parameters, using training data from two sources: the structural data and fitness data. The structural data were initially taken from the training set of the fourth training stage of SPIRED (Supplementary Table S12) with lDDT >0.5 (~133,000 protein chains), and were then randomly shuffled and divided into 133 subsets, each of 1000 samples. The fitness data were the DMS data used in the first stage. For each epoch of the training process, the samples included one subset of structural samples and nearly all fitness samples (from 482 proteins after excluding 3 large proteins with length > 800 residues), in total of 1482 proteins. After training over all structural samples by 133 epochs, SPIRED-Fitness was finetuned on CPU for the three large proteins excluded previously from the fitness samples. The learning rate for the SPIRED module was fixed at 10−5, while that for the Fitness Module was initialized to 10−4 and then manually adjusted to 10−5. The loss for this stage is represented by the Union Loss defined in Eq. (2): the Structural Loss alone was applied for the structural samples, while the joint loss of structure and fitness was applied for the fitness samples. The Structure Loss took the same form as that used in the SPIRED model training (see Supplementary Information 4.1 for details), but was scaled by a weight of 0.05.$${{{\rm{Union}}}}\_{{{\rm{Loss}}}}=\left\{\begin{array}{ll}0.05\times {{{\rm{Struct}}}}\_{{{\rm{Loss}}}} \hfill \quad &({{{\rm{Structure}}}}\,{{{\rm{data}}}})\\ 0.05\times {{{\rm{Struct}}}}\_{{{\rm{Loss}}}}+{{{\rm{Fitness}}}}\_{{{\rm{Loss}}}}\quad &({{{\rm{Fitness}}}}\,{{{\rm{data}}}}) \hfill \end{array}\right.$$
(2)
Training and test sets for protein stability predictionThe datasets utilized for training and testing in SPIRED-Stab are described in detail here.Dual Task Dataset is a dataset constructed in this work for the training of SPIRED-Stab. We collected single, double and triple or higher-order mutation data of both ΔΔG and ΔTm from two protein stability databases, ProThermDB63 and ThermoMutDB64, and cautiously cleaned each piece of data to generate the dataset for the ΔΔG/ΔTm dual task training of SPIRED-Stab. The final dataset contains 8458 pieces of single mutation data, 966 pieces of double mutation data and 619 pieces of triple or higher-order mutation data (i.e. mutation points ≥3), where 5331 pieces of data only have the ΔΔG label, 2560 pieces of data only have the ΔTm label and 2152 pieces of data have both ΔΔG and ΔTm labels.S66951 is a widely used test set to assess the accuracy of ΔΔG prediction. This dataset consists of 669 single-point mutations derived from 94 proteins selected from ThermoMutDB (v1.3). These proteins have sequence similarity of < 25% with the proteins in the S2648 and VariBench databases that have been extensively used as training data in many previous researches.S46152, a subset of S669 dataset with the errors manually corrected, contains 461 single-point mutations. The S461 dataset is used as an auxiliary benchmark test dataset to evaluate ΔΔG prediction.S557 is a subset of S571 dataset constructed in our previous work25 to specifically address the ΔTm evaluation problem. We no longer consider pH values, and thus have removed redundancy from the original dataset. This dataset now contains 557 pieces of single mutation data, and is used as an objective benchmark test dataset to evaluate the ΔTm prediction.Training process for SPIRED-StabThe training of SPIRED-Stab could be divided into three stages. In all training stages, we adopted the Adam optimizer and the learning rate would decline by half if the validation loss did not decrease for five consecutive epochs.In the first stage, the model parameters of SPIRED-Fitness (except for the MLP module) were used as the starting point of SPIRED-Stab. The training dataset was the cDNA proteolysis dataset described above, and the Soft Spearman Loss was used to evaluate the Spearman correlation coefficient between the predicted ΔΔG and the experimental values. The initial learning rate of this stage was 10−3 and all parameters except the final ΔΔG_coef and ΔTm_coef parameters (Algorithm 19) were optimized.Since the ΔΔG values in the cDNA proteolysis dataset are derived from Bayesian inference, it is necessary to train the model on the ΔΔG/ΔTm dataset with experimentally measured values. In the second stage, SPIRED-Stab was further trained on our collected and curated ΔΔG/ΔTm dataset, namely the Dual Task Dataset, with the Soft Spearman Loss employed for the optimization of ranking correlation. In this stage, the MLP layer for the ΔΔG prediction was optimized with the initial learning rate of 5 × 10−4 and the corresponding value for the ΔTm prediction was 5 × 10−3 (Fig. 1e).In the third stage, the numerical difference between the predicted and experimentally determined ΔΔG/ΔTm values was computed, following the Mean Squared Error (MSE) Loss. During the training in this stage, the majority of the parameters of SPIRED-Stab were frozen, and only the final ΔΔG_coef and ΔTm_coef parameters were updated with an initial learning rate of 10−2, aiming for matching the predicted values towards the actual ΔΔG/ΔTm distribution without perturbing the learned ranking of mutational effects.Evaluation metricsIn this study, we utilize TM-score and lDDT to assess the similarity between predicted and true protein structures. Besides, we mainly employ the Spearman correlation coefficient to assess the prediction power of the tested models on fitness, ΔΔG and ΔTm values. Specifically, we examine the correlation between the predicted scores and the experimental fitness/ΔΔG/ΔTm for different mutations.TM-score32 (Template Modeling score) is a metric used to assess the topological similarity between protein structures. The protein structure of interest (i.e. target) is aligned to a reference structure (i.e. template), and the root-mean-square-deviation (RMSD) of the aligned residue pairs is calculated. According to Eq. (3), TM-score ranges from 0 to 1, with the value of 1 indicating a perfect match between structures. TM-score is more sensitive to the global topology than to local structural difference, with the value below 0.17 as an indicator of lack of relationship between the protein structures and a value greater than 0.5 as an indicator of belonging to the same topology.$${{{\rm{TM}}}}-{{{\rm{score}}}}={{{\rm{Max}}}}\left[\frac{1}{{L}_{N}}{\sum}_{i=1}^{{L}_{r}}\frac{1}{1+{(\frac{{d}_{i}}{{d}_{0}})}^{2}}\right],$$
(3)
where di represents the distance between the ith aligned residue pairs, d0 is a normalization scale, LN denotes the original length of the protein, and Lr represents the number of aligned residues.lDDT65 (local Distance Difference Test) is a superposition-free score that indicates the difference in local inter-residue distances between the predicted structure and reference structure. First, the distance (Ltrue) between each pair of atoms in the reference structure is calculated, excluding distances beyond the threshold R0 and atoms within the same residue. The distances (Lpred) between corresponding atom pairs are then computed for the predicted structure. Next, the absolute difference in distances between the two structures for each atom pair is calculated (Diff = \(\left\vert {L}_{true}-{L}_{pred}\right\vert\)). The counts of atom pairs with Diff values below four thresholds (0.5, 1, 2, and 4 Å) are calculated, and the average of these counts divided by the total number of atom pairs produces the lDDT score. In this study, we only calculate the lDDT score for Cα atoms (lDDT-Cα) with R0 = 15 Å.Pearson correlation coefficient (r) is a measure used to quantify the strength of the linear relationship between two variables, X and Y. As shown in the Eq. (4), it is computed by calculating the ratio of the covariance between the two variables to the product of their standard errors. The coefficient ranges from −1 to 1, where −1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.$${{{{\boldsymbol{r}}}}}_{X,Y}=\frac{{{{\rm{cov}}}}(X,Y)}{{\sigma }_{X}{\sigma }_{Y}},$$
(4)
where cov denotes the covariance and σ stands for the standard deviation (i.e. the square root of variance).Spearman correlation coefficient (ρ) is commonly used to describe the strength of a monotonic relationship between two variables. As shown in the Eq. (5), Spearman correlation coefficient is calculated by utilizing the ranked values of a pair of variables (X, Y). This characteristic makes the Spearman correlation coefficient more robust to outliers in the data.$$\begin{array}{rc}{{{{\boldsymbol{\rho }}}}}_{X,Y}={{{{\boldsymbol{r}}}}}_{{{{\rm{R}}}}(X),{{{\rm{R}}}}(Y)}&=\frac{{{{\rm{cov}}}}({{{\rm{R}}}}(X),{{{\rm{R}}}}(Y))}{{\sigma }_{{{{\rm{R}}}}(X)}{\sigma }_{{{{\rm{R}}}}(Y)}},\end{array}$$
(5)
where R denotes the ranking operation for the variables.Kendall correlation coefficient (τ) is another non-parametric metric to measure the correlation between ranks of the variables and can be interpreted as the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs (Equation (6)). The Kendall correlation coefficient is more robust than the Spearman correlation coefficient while usually being smaller in magnitude.$$\tau=\frac{{n}_{c}-{n}_{d}}{{n}_{c}+{n}_{d}},$$
(6)
where nc denotes the number of concordant pairs while nd denotes the number of discordant pairs.Top K precision is a metric measuring the fraction of the truly top K mutations among the predicted top K mutations (Equation (7)). This metric serves as a reference for the success rate in the real-world protein engineering process.$${{{\rm{Top}}}}\,{{{\rm{K}}}}\,{{{\rm{precision}}}}=\frac{{\sum }_{i}^{n}{I}_{1\le rank({\hat{Y}}_{i})\le K}{I}_{1\le rank({Y}_{i})\le K}}{{\sum }_{i}^{n}{I}_{1\le rank({\hat{Y}}_{i})\le K}},$$
(7)
where \(rank({\hat{Y}}_{i})\) and rank(Yi) denote the rank (in descending order) of the predicted value and that of the label, respectively, and I is the indicator function.NDCG (Normalized Discounted Cumulative Gain) is a metric in ProteinGym43 for evaluating the fitness prediction methods. Suppose that the top K scores provided by a predictor are sorted in descending order as \({\hat{Y}}_{1}\ge {\hat{Y}}_{2}\ge \cdots \ge {\hat{Y}}_{K}\). DCG (Discounted Cumulative Gain) reports the sum of corresponding true labels by discounting each term according to its predicted rank:$${{{\rm{DCG}}}}=\mathop{\sum}_{i}^{K}\frac{{Y}_{i}}{{\log }_{2}(i+1)},$$
(8)
where Yi is the true label for the ith ranking variant among the top K predictions. NDCG normalizes DCG of a predicted rank by ideal DCG, which is calculated similarly to DCG but with a perfect ranking based on true labels. The metric encourages models to accurately rank higher fitness values in earlier positions.Top 10% recall is a metric adopted by ProteinGym43 for fitness prediction evaluation, which reports the proportion of truly Top 10% variants among the Top 10% predictions. Its definition is identical to Equation (7), except that K refers to a specific ratio of 10%.AUC, the abbreviation of the area under a receiver operating characteristic (ROC) curve, is a metric adopted by ProteinGym43 for measuring the binary classification performance of models. The value of AUC ranges from 0.5 to 1, where 0.5 and 1 correspond to random and perfect classifications, respectively.MCC (Matthews correlation coefficient) is a metric adopted by ProteinGym43 for evaluating the performance of binary and multiclass classifications. The true and false positives/negatives are considered in the calculation of MCC. The value of MCC ranges from -1 to 1, where -1 represents a reverse prediction of classification, 0 stands for a random prediction, and 1 refers to perfect classifications. For the binary classification, the calculation of MCC is proceeded in the following manner:$${{{\rm{MCC}}}}=\frac{({{{\rm{TP}}}}\times {{{\rm{TN}}}})-({{{\rm{FP}}}}\times {{{\rm{FN}}}})}{\sqrt{({{{\rm{TP}}}}+{{{\rm{FP}}}})({{{\rm{TP}}}}+{{{\rm{FN}}}})({{{\rm{TN}}}}+{{{\rm{FP}}}})({{{\rm{TN}}}}+{{{\rm{FN}}}})}},$$
(9)
where TP, FP, TN, and FN represent the numbers of true positives, false positives, true negatives and false negatives, respectively.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles