Prediction of mutation-induced protein stability changes based on the geometric representations learned by a self-supervised method | BMC Bioinformatics

The architecture of mutDDG-SSMThe proposed mutDDG-SSM consists of two parts: the protein structural feature extractor and the mutation-induced protein stability change predictor, as shown in Fig. 1. The GAT model, which is excellent in extracting geometric features, was used as the feature extractor. A self-supervised learning scheme was designed for the training of the GAT model to extract the representation of the intrinsic inter-residue interactions within protein structure. The geometric representations extracted from the pre-trained GAT model were inputted to the second part of mutDDG-SSM to predict the changes of protein stability caused by residue mutation. The XGBoost model was applied as the protein stability change predictor owing to its good performance in avoiding overfitting problem.Fig. 1Overall view of the mutDDG-SSM framework. a The self-supervised GAT encoder to extract the geometric representations encoded in the protein structure. b The XGBoost predictor to evaluate the changes in protein stability upon residue mutation by using the features learned by the GAT encoderDesign of tasks for the self-supervised learning schemeTo learn the intrinsic features encoding the atomic interactions of the residue with its surrounding neighbors in protein structures, a self-supervised learning scheme was designed. A large-scale non-redundant and high-resolution protein structures were used for the self-supervised learning. For a given protein structure, the side chain of a randomly selected residue was perturbed, and the task for the self-supervised training was to predict the original conformation of the perturbed residue within the protein. By this way, the essential geometric features underlying the atomic interactions of the residue with its surroundings in the protein structure were captured.Specifically, in residue perturbation, the values of the torsion angles of the perturbed residue were sampled according to the distribution and probability provided by the Dunbrack backbone-dependent rotamer library [39]. Dunbrack rotamer library, which is derived from the analysis of large-scale protein native structures, provides the probability of discrete side-chain torsion angles of a residue dependent on the backbone dihedral angle values. Based on the Dunbrack rotamer library, we randomly alter the residue conformation within the protein structure to adopt a new conformation. Based on the perturbed conformation, the task of the self-supervised learning is to predict the original conformation of the residue within the protein. For glycine and alanine, whose side chain only consist of hydrogen or methyl groups, the original conformation of the side chain remains unchanged.Representation extraction by the self-supervised learning scheme with a graph attention network modelThe GAT model was used for the self-supervised learning to capture the geometric features encoding the inter-residue interactions within the protein structure. GAT model describes protein structures as graphs, in which atoms in the structure were represented by nodes and the interactions between atoms were represented as edges. The attention mechanism of the model enables the features of the nodes to be updated using their neighborhood’s features with different weights. The GAT model has been widely applied to bioinformatics studies, such as protein–protein binding prediction [40], protein–ligand interaction prediction [41], and protein function prediction [42].Specifically, for a given protein structure with a perturbed residue, a graph was firstly built, in which the atoms in the protein were represented by nodes and the interactions between atoms were simplified by edges [29]. Only the heavy atoms in the structure, including carbon, nitrogen, oxygen, and sulfur, were considered for the graph construction. Besides that, to reduce the computation complexity, only the atoms within a radius of 12.0 Å to the center of the perturbed residue were taken into account to build the graph. In the case of the residue having only a portion of its atoms falling within the 12.0 Å range, all the atoms belonging to the residue were retained in the graph. If the distance between two nodes is less than a threshold (3.0 Å was adopted), an edge was assumed to exist between these them. The nodes in the graph were attached with attributes that describe the properties of the atoms. In this study, for each node, 36-dimensional attributes were considered, which include:

(1)

The atom type, i.e., carbon, nitrogen, oxygen, or sulfur, represented by one-hot encoding;

(2)

The type of the amino acid, to which the atom belongs, represented by 20-dimensional one-hot encoding;

(3)

The type of DSSP secondary structure [43] involving the atom, i.e., alpha-helix, isolated beta-bridge residue, strand, 310-helix, pi-helix, turn, bend or none, encoded by one-hot codes;

(4)

A one-hot encoding of the atom that was perturbed or not;

(5)

A binary attribute representing whether the solvent accessible surface area (SASA) [44] of the atom is greater than 0 or not (denoted by 1 or 0, respectively);

(6)

A binary feature indicating whether the atom is a Cα atom or not, represented by 1 or 0, respectively.

The node figures and edges constructed from the protein structure upon residue perturbation were fed into the GAT model. GAT updates the node features using a self-attention mechanism. The normalized attention coefficients were calculated according to the formula described in the reference [29]. Then, based on the normalized attention coefficients, the node features were updated by weighted summation of the features from its neighboring nodes. In this study, a multi-head attention was used in the self-attentional layer, in which the feature representation output was the concatenation of the output from eight independent self-attention operations. To better extract the representations encoding the inter-residue interactions within the protein structure, four eight-head attention layers were stacked, and the output representations from the third and fourth layers were concatenated as the final geometric representations.Based on the geometric representations extracted by the GAT model, a three-layer perceptron network was employed to predict the original conformation of the perturbed residue within the protein. Considering that directly predicting the absolute value of the atomic positions increases the difficulty, our model aimed to predict the deviation of the atoms between the perturbed and original positions, expressed by$$\Delta {d}_{i}=MLP\left({\overrightarrow{g}}_{i}\right)$$
(1)
where \(\Delta {d}_{i}\) is the predicted deviation of the perturbed atom \(i\) from its native position, \({\overrightarrow{g}}_{i}\) is the geometric representations of atom \(i\), and \(MLP\) stands for the multi-layer perceptron network. The real value for the deviation of the perturbed atom \(i\), denoted as \({\Delta d}_{i}^{real}\), was computed by the root mean square deviation (RMSD) of the atomic coordinates between the perturbed and original conformations. Then, the loss function of the GAT model for the self-supervised learning was defined as the mean square error between the predicted and real deviations of the perturbed atoms, given by$$L=\frac{1}{{N}_{p}}\sum_{i\in {N}_{p}}{\left(\Delta {d}_{i}- {\Delta d}_{i}^{real}\right)}^{2}$$
(2)
where \({N}_{p}\) is the number of atoms in the perturbed residue. The GAT model was implemented in the framework of PyTorch [45] and PyTorch Geometric [46].In summary, based on the self-supervised learning scheme with the GAT model, the geometric representations encoding the inter-residue interactions were extracted from the large-scale non-redundant protein structures.Prediction of mutation-induced protein stability changes by eXtreme Gradient Boosting modelBased on the geometric representations extracted by the self-supervised learning scheme, the changes in protein stability upon residue mutation were predicted using a XGBoost model. XGBoost model has good performance in avoiding overfitting compared to other ML models.The geometric representations of both the original protein and its mutant were first generated by using the trained self-supervised GAT model. As given by Eq. (3), the geometric representations of each atom were taken from the third and fourth attention layers of the GAT model. The representations of the atoms in the mutated residue and those of the other atoms within a radius of 12.0 Å to the center of the mutated residue were represented by.$$\left\{ {\vec{h}_{i}^{^{\prime}\left( L \right)} ,i \in A_{mo} } \right\}, \left\{ {\vec{h}_{j}^{^{\prime}\left( L \right)} ,j \in A_{no} } \right\}, \left\{ {\vec{h}_{k}^{^{\prime}\left( L \right)} ,k \in A_{mm} } \right\}, \left\{ {\vec{h}_{l}^{^{\prime}\left( L \right)} ,l \in A_{nm} } \right\}$$
(3)
where \(L = 3{ }\;or{ }\;4\) that stands for the representations derived from the third and fourth layers of the GAT model, respectively; \({A}_{mo}\), \({A}_{no}\), \({A}_{mm}\), and \({A}_{nm}\) denote the atoms belonging to the mutated residues in the original protein, the non-mutated residues within a 12.0 Å radius in the original protein, the mutated residues in the mutant protein, and the non-mutated residues within a 12.0 Å radius in the mutant protein, respectively. Then, both the maximum and mean values over the atoms in the mutated residue were computed to represent the geometric features of the mutated residue. The maximum and mean values of the non-mutated atoms were calculated to represent the geometric features of the environment around the mutated residue. The differences in the maximum and mean values between mutated and non-mutated atoms were also computed to represent the distinct geometric features of the mutated residue in comparison of its environment. All these geometric representations both for the original protein and its mutant were concatenated together, and standardized by removing the mean and scaling to unit variance, which can be expressed as$$\begin{aligned} \vec{F}_{mo}^{\left( L \right)} = & \left\{ {\vec{h}_{i}^{^{\prime}\left( L \right)} ,i \in A_{mo} } \right\}_{max} \parallel \left\{ {\vec{h}_{i}^{^{\prime}\left( L \right)} ,i \in A_{mo} } \right\}_{mean} \\ \vec{F}_{no}^{\left( L \right)} = & \left\{ {\vec{h}_{j}^{^{\prime}\left( L \right)} ,j \in A_{no} } \right\}_{max} \parallel \left\{ {\vec{h}_{j}^{^{\prime}\left( L \right)} ,j \in A_{no} } \right\}_{mean} \\ \vec{F}_{mo – no}^{\left( L \right)} = & \vec{F}_{mo}^{\left( L \right)} – \vec{F}_{no}^{\left( L \right)} \\ \vec{F}_{mm}^{\left( L \right)} = & \left\{ {\vec{h}_{k}^{^{\prime}\left( L \right)} ,k \in A_{mm} } \right\}_{max} \parallel \left\{ {\vec{h}_{k}^{^{\prime}\left( L \right)} ,k \in A_{mm} } \right\}_{mean} \\ \vec{F}_{nm}^{\left( L \right)} = & \left\{ {\vec{h}_{l}^{^{\prime}\left( L \right)} ,l \in A_{nm} } \right\}_{max} \parallel \left\{ {\vec{h}_{l}^{^{\prime}\left( L \right)} ,l \in A_{nm} } \right\}_{mean} \\ \vec{F}_{mm – nm}^{\left( L \right)} = & \vec{F}_{mm}^{\left( L \right)} – \vec{F}_{nm}^{\left( L \right)} \\ \vec{F}^{\left( L \right)} = & std\left( {\vec{F}_{mo}^{\left( L \right)} \parallel \vec{F}_{no}^{\left( L \right)} \parallel \vec{F}_{mo – no}^{\left( L \right)} \parallel \vec{F}_{mm}^{\left( L \right)} \parallel \vec{F}_{nm}^{\left( L \right)} \parallel \vec{F}_{mm – nm}^{\left( L \right)} } \right) \\ \end{aligned}$$
(4)
where \(L = 3{ }\;or{ }\;4\) that stands for the representations derived from the third and fourth layers of the GAT model, respectively; \(std\left(\cdots \right)\) means the standardization operation for each feature over the whole training dataset; The subscripts ‘max’ and ‘mean’ denote the max-pooling and mean-pooling operations over the atoms in the corresponding atom sets, respectively; \(\parallel\) represents the concatenation of the geometric representations. Then, these geometric representations from the third and fourth layers of GAT model were concatenated together and fed into the XGBoost model, and the change of protein stability \(\Delta \Delta G\) caused by residue mutations was outputted, i.e.,$$\Delta \Delta G=XGBoost\left({\overrightarrow{F}}^{\left(3\right)}\parallel {\overrightarrow{F}}^{\left(4\right)}\right)$$
(5)
In our study, the XGBoost model was trained and tested, respectively, by using independent protein datasets with available experimental mutation-caused stability change data.The training techniques of mutDDG-SSMThe architecture of the proposed mutDDG-SSM framework is composed of two separate components, i.e., the self-supervised GAT encoder to extract the geometric representations of the protein structure and the XGBoost predictor to predict the changes in protein stability upon residue mutation by using the features learned by the GAT encoder. These two parts were trained separately by using different datasets. The GAT model was trained on a large-scale unlabeled high-resolution protein structure dataset via a self-supervised learning scheme. The XGBoost model was trained on the labeled dataset with available mutation-induced stability change values obtained from experiments.The GAT model was trained by using batch gradient descent approach with the Adam optimizer, The batch size was 128 and the learning rate was set to 0.001. For the training of the XGBoost model, the training set was divided into ten subsets, and then ten separate models were optimized. Each model was trained on nine subsets and the remaining one subset was utilized as the validation set. The average value of the outputs of these optimized ten models was taken as the final output. The hyperparameters of XGBoost model were chosen from \(\text{n}\_\text{estimators}\in \left\{10000, 20000, 30000\right\}\), \(\text{max}\_\text{depth}\in \left\{5, 6, 7\right\}\), \(\text{subsample}\in \left\{0.6, 0.7, 0.8\right\}\), \(\text{colsample}\_\text{bytree}\in \left\{0.55, 0.56, 0.57\right\}\) and \(\text{learning}\_\text{rate}\in \left\{0.02, 0.05, 0.1\right\}\) by grid search procedure. The best hyperparameters were determined to yield the highest performance of the model.Dataset preparationTo train the geometric feature extractor, i.e. the GAT model, via the self-supervised learning approach, a large-scale training dataset was constructed from the Protein Data Bank (PDB) by using the PISCES [47] server. PISCES provided a tool to screen non-redundant and high-resolution protein structures from PDB based on sequence identity and structural quality. Considering that the quality of protein structures in the training dataset influences the performance of the model [48], only the protein structures meeting the following criteria were selected and included in the dataset:

(1)

The structure was obtained by X-ray crystallography with an R-value less than 0.25;

(2)

The resolution of the structure was below 2.0 Å;

(3)

The protein length was within the range of 40–500 amino acids and devoid of any break or missing of residues;

(4)

Sequence identity among the proteins in the dataset was below 25%.

A total of 5893 protein structures were collected. After adding missing atoms by using PDBFixer (https://github.com/openmm/pdbfixer) [49], these protein structures were included in the dataset for the training and validation of the GAT model. From the collected dataset, 5238 protein structures were randomly partitioned into training set, which were used to train the protein geometric feature extractor, namely the GAT model, using the self-supervised learning scheme as discussed above. The remaining 655 protein structures from the dataset were taken as the validation set. The PDB accession code, along with the organism and structural class, of the protein structures in the training and validation set were listed in Supplementary Table 1. During the training and validation of the GAT model, residue perturbation was performed 2000 times for each protein structure, and therefore 10,476,000 and 1,310,000 data were used, in fact, for the training and validation of the GAT model, respectively.The mutation-induced \(\Delta \Delta G\) predictor, namely the XGBoost model, was trained using the widely adopted Q3421 dataset. The prediction performance of the model was tested on several commonly used datasets, including Ssym, S350, S611, S276 and S669, and two protein cases, i.e., myoglobin and p53, to explore the accuracy and generalizability of our model.Q3421 contains 3421 single-point mutations with experimentally measured \(\Delta \Delta G\) values from 150 proteins. This dataset contains 14 proteins that are also included in the Ssym test dataset. Therefore, these 14 proteins along with the related mutation data were removed from Q3421, and then a dataset consisting of 3213 mutations, called Q3213, was obtained. Furthermore, to balance the stabilizing and destabilizing mutations in the dataset, the inverse mutation assigned with opposite \(\Delta \Delta G\) value was created for each direct mutation in the dataset. The structure of the protein with inverse mutations was constructed using Rossetta [50]. By this way, the Q3213 dataset was augmented to a balanced dataset containing 6426 mutation data.For the test datasets, Ssym is composed of 684 mutation data, including 342 direct mutations with available experimental \(\Delta \Delta G\) values as well as the corresponding inverse mutations. S350 consists of 350 experimental mutations from 67 different proteins. S611 is an extension of S350, which includes 611 direct and inverse mutations. S276 contains 276 experimental \(\Delta \Delta G\) values from 37 different proteins, and S669 consists of 669 experimental data from 94 proteins. These datasets were used to test the performance of the model trained by Q3213. After removing the shared mutations in the training set Q3213, the number of mutation data points in S350 was decreased to 203, named as S203, and that of S611 was reduced to 347, called S347, with 203 direct and 144 inverse mutations. For S276 and S669, the number of experimental values was reduced to 254 and 615 (called S254 and S615 respectively) after removing the same mutations shared with Q3213. Then, S254 and S615 were augmented to include the corresponding inverse mutations of the experimental data. The PDB accession code and mutation information of the protein structures in S203, S347, S254 and S615 were listed in Supplementary Tables 2, 3, 4 and 5, respectively. Myoglobin and p53 that were used as case studies in our study contain 134 and 42 experimental mutations, respectively. The mutations in myoglobin and p53 were listed in Supplementary Tables 6 and 7. In our study, the corresponding inverse mutations were also constructed and included in the datasets for these two proteins.Performance measuresIn this study, the performance of mutDDG-SSM model was evaluated and compared with other previously developed methods using the Pearson correlation coefficient (PCC), root mean square error (RMSE), and mean absolute error (MAE) between the predicted \(\Delta \Delta G\) values and the experimental data. The accuracy (ACC), sensitivity (SEN), specificity (SPE) and Matthews correlation coefficient (MCC) were applied to evaluated the performance of the model in classifying the stabilizing and destabilizing mutations. In addition, to evaluate the biasedness of the model in predicting the ΔΔG values of direct and inverse mutations, \({r}^{d-i}\) and \(<\delta >\) were calculated by$${r}^{d-i}=PCC({\Delta \Delta G}_{direct}, {\Delta \Delta G}_{inverse})$$
(6)
$$<\delta >= \frac{1}{n}{\sum }_{i=1}^{n}({\Delta \Delta G}_{direct}+{\Delta \Delta G}_{inverse})$$
(7)
here \({\Delta \Delta G}_{direct}\) and \(\Delta \Delta {G}_{reverse}\) represent the predicted values for the direct and inverse mutations, respectively. n is the number of data points.

Hot Topics

Related Articles