Graph masked self-distillation learning for prediction of mutation impact on protein–protein interactions

Definition of ΔΔG
Given a 3D structure of a complex, the wild-type and mutant residues, the ΔΔG is defined as:$$\Delta \Delta G=\Delta {G}_{{mutant}}-\Delta {G}_{{wild}-{type}}$$
(1)
where \(\Delta {G}_{{mutant}}\) and \(\Delta {G}_{{wild}-{type}}\) represent the protein folding free energy of the mutant and wild-type complexes, respectively. A positive value of ΔΔG indicates a lower binding affinity between the two proteins, while a negative value signifies a higher binding affinity.DatasetsWe combined PDBbind61 and 3D complex62 to generate the dataset for pre-training in graph masked self-distillation learning module. Since protein function is fundamentally determined by its unique 3D structure, different from the common methods which generally remove the redundancy by sequence similarity or do not even consider the data overlap issue, here we removed the redundancy based on structural similarity using ECOD (Evolutionary Classification of Domains) classifier to ensure the diversity of the structures than that of the sequences, and to make the dataset more suitable for the structural geometric representation learning. The ECOD classification domains consist of four categories: X (possible homology), H (homology), T (topology), and F (family). Specifically, we retrieved all these four categories for all chains of each complex from ECOD classifier (file: “ecod.develop285.domains.txt”, version: “develop285”). The complexes were considered similar only if they contain the identical classification domains. Only one complex was randomly selected from the set with same classification domains. Finally, 12,000 unlabeled complexes were collected for the pre-training (Supplementary Data 2).AB-bind, SKEMPI and SKEMPI 2.0 are the three common and popular databases for the prediction of ΔΔG. These databases include experimentally measured ΔΔGs. It has been reported that the SKEMPI 2.0 includes the data from AB-bind and SKEMPI36, however, we found, in AB-bind and SKEMPI, there are still some entries not in SKEMPI 2.0. Therefore, the point mutation samples for model training, validation, and benchmark test of ΔΔG in this study were collected from the combination of these three databases the ensure the use of sufficient labeled data. In specific, the AB-bind and SKEMPI 1.0 databases consist of 645 and 2897 single point mutations, respectively, while as an updated version of SKEMPI, the SKEMPI 2.0 incorporates new mutations in the literature after the release of SKEMPI, including AB-bind, PROXiMATE63, and dbMPIKT64, resulting in a total of 6655 single-point mutations. We then used ECOD classifier same as above to evaluate the structural similarity of protein complexes. Specifically, when partitioning the dataset, the similar complexes were assigned to only one of the training, validation, or test sets, guaranteeing that the data in these sets remains distinct from one another. Finally, after removing the similar samples and those which cannot be found in true PDB files, our dataset contains 4310 single-point mutations. This resulting dataset was further divided into training, validation, and benchmark test subsets with a ratio of around 3:1:1 (Supplementary Data 1). It is worth noting that we required that there are no structurally similar complexes between any of the two sets to avoid the data leakage, thereby allowing for a fair performance evaluation. Additionally, the use of datasets with low similarity can compel the model to learn implicit structural representations of the complexes more effectively. Taken together, these strategies enhance the generalization and robustness of our model.We next evaluated the ability of PIANO in highlighting pathogenic variants and crucial proteins, as well as distinguishing de novo variants in neurodevelopmental disease cases and controls using the datasets sourced from Li et al.65. The computation of PIANO score for a mutation requires a pair of interacting proteins, its whole 3D complex structure and the wile-type and mutant residues in target protein, however, the source data only provide a single UniProt66 accession number for each mutation, lacking information regarding its corresponding interacting partner. To overcome this limitation, for each UniProt accession number in the source data, we first retrieved all human binary PPIs from HINT67 to find all PPIs involving that UniProt. The SIFTS68 was then used to obtain the one-on-one mapping between PDB chains and UniProt accession numbers. If multiple PDB chains were mapped to a single UniProt accession number, we only selected the PDB structure that has the most amino acids resolved. Finally, we collected 509 pathogenic, 287 benign, and 3806 VUS mutation sites, which have the corresponding interacting partner and the true 3D complex PDB structures available for assessing the ability of PIANO in highlighting pathogenic variants, respectively (Supplementary Data 4). Using these mutation data, we further conducted the retrieval for different groups of protein, which have different essentiality (olfactory, non-essential, recessive, dominant, essential and haploinsufficient) in source data for the analysis of PIANO’s capability of highlighting crucial proteins. Finally, we collected 0, 185, 1537, 1609, 14, and 1125 mutation sites for the groups of olfactory, non-essential, recessive, dominant, essential and haploinsufficient, respectively (Supplementary Data 4). We removed the groups of olfactory and essential as they have very limited data size.With the same information available for PIANO prediction, we also collected 643 de novo missense mutations in neurodevelopmental disorder probands (cases) and 129 de novo missense mutations in unaffected siblings of autism spectrum disorder probands (controls) for the evaluation of ability of PIANO in distinguishing de novo variants in neurodevelopmental disease cases and controls (Supplementary Data 5). As these data lack mutant residues, we considered the remaining 19 amino acid types distinct from the wild-type as potential mutant residues. The original value of maximum absolute ΔΔG resulted by these 19 different mutants was selected to evaluate the impact of de novo missense mutations.Graph construction and feature extractionFor a given complex, we constructed the graphs at both residue and atom levels. We treated residues as nodes and their interactions as edges. For an atom graph, we regarded the atoms as nodes and their interactions as edges, and we only consider four types of atoms: C, O, N, and S. In specific, we selected 64 residues closest to the mutation sites to be the nodes of a residue graph which was determined by iteratively rebuilding the model using different numbers of residues as the cutoffs (Supplementary Table 4). We also included residues within 10 Å around the interfaces as part of the residue graph nodes. The residue with the change in accessible surface areas to a single chain larger than 1.0 Å is regarded as the interface site69. For each node in a residue graph, the edge was assigned between that node and its nearest 16 nodes. In the pre-training phase, the mutation site was randomly selected. For an atom graph, all the atoms corresponded to the residues in the corresponding residue graph were selected, and we assume there is an edge between two atoms if their distance is less than 3 Å. In both residue and atom graphs, the edge feature vector contains three entries, including the Euclidean distance, polar and azimuth angles. For residue graphs, the edge features were calculated using the coordinates of the corresponding Cα atoms.We employed a wide range of structural and sequence information to represent the nodes with the expectation that more comprehensive information could be considered. In total, we constructed a 108D feature vector for each residue and an 8D feature vector for each atom, respectively. These features are listed in detail as below:Features for residue nodes

1.

Residue type: This is a 40D one-hot vector representing the wild-type and mutant residues. In one-hot encoding, the categorical data are represented as binary vectors, where each vector contains a single value of 1 that signifies the presence of a particular category, while all other entries are set to 0.

2.

Position information: This is a 2D one-hot vector for a residue which indicates whether that residue is the mutated residue and whether the residue and the mutated residue are in the same chain.

3.

Residue property: This 8D feature vector includes whether a residue is a polar residue and seven other residue properties70: steric parameter, polarizability, volume, hydrophobicity, isoelectric point, helix probability and sheet probability.

4.

Residue depth: This 1D feature describes the average distance of the atoms of a residue from the solvent accessible surface. It was calculated using MSMS71.

5.

Secondary structure: The DSSP72 was used to calculate the secondary structure state (helix, strand, and coil), which was encoded as a 3D one-hot vector.

6.

Relative accessible surface area: This 1D feature indicates a degree of residue solvent exposure and was also calculated using DSSP.

7.

Evolutionary information: The position-specific scoring matrix (PSSM) profile and HMM (hidden Markov model) profile were used to represent the evolutionary information (Supplementary Fig. 11). In specific, the PSSM profile was generated using PSI-BLAST73 against the NCBI non-redundant protein sequence database with the substitution matrix, rounds of iteration and E-value chosen as BLOSUM62, 3 and 0.001, respectively. This results in a 20D feature vector for each residue, and each value in the vector represents the log-likelihood score of the substitution of the 20 types of residues at that position during the evolution process. The HMM profile was generated by HHblits74 against UniRef3075 using the default parameters. Compared to the PSSM profile, the HMM profile contains additional information describing the insertion, deletion, and match during multiple sequence alignments. HMM profile contains a 30D feature vector for each residue, including 20D observed frequencies for 20 types of residues in homologous sequences, 7D transition frequencies and 3D local diversities. This feature thus contains 50 entries in total.

8.

Residue coordinate: This 3D feature vector contains the original coordinate information for each residue which was represented by its Cα atom.

Features for atom nodes

1.

Atom type: This 4D one-hot vector represents the atom type, which are C, N, O, and S we only considered in this study.

2.

Accessible surface area: This 1D atom accessible surface area was calculated using NACCESS76.

3.

Atom coordinate: This 3D feature vector contains the original coordinate information for each atom.

The features for sequence-based embeddings consist of one-hot residue type, position information, residue properties and evolutionary information, which contains 80 entries for each of wild-type and mutated residues in total.Graph transformer networkIn PIANO, both the student and teacher encoders were configured with message-passing graph transformer networks given its great advantages in capturing information indicated in long-range dependencies and the more relevant nodes and edges37. For a given feature vector \({X}_{i}\in {{\mathbb{R}}}^{D\times M}\) (D and M represent the dimension of node features and the number of nodes, respectively) of a graph node i, both the node features \(\{{X}_{j}{| \, j}\in N(i)\}\) and edge features \(\{{e}_{i,j}\in {{\mathbb{R}}}^{{D}_{e}\times {N}_{e}}{| \, j}\in N(i)\}\), where \(N(i)\) represents the set of neighboring nodes for node \(i\), determined by the adjacency matrix \(A\in {{\mathbb{R}}}^{M\times M}\) (De and Ne represent the dimensions of edge features and the number of edges, respectively), were considered to update the node feature representations using a transformer-based attention mechanism. Specifically, the feature embeddings for the node i were calculated as:$${X}_{i}^{* }={W}_{1}{X}_{i}$$
(2)
where \({W}_{1}\in {{\mathbb{R}}}^{{D}_{h}\times D}\) is the learnable weight matrix, and \({D}_{h}\) is the hidden size in graph transformer. The attention weights between neighbor nodes and edges were calculated as:$${\alpha }_{i,j}={Softmax}\left(\frac{{\left({W}_{2}{X}_{i}\right)}^{T}\left({W}_{3}{X}_{j}+{W}_{4}{e}_{i,j}\right)}{\sqrt{d}}\right)$$
(3)
where \({W}_{2}\in {{\mathbb{R}}}^{{D}_{h}\times D}\), \({W}_{3}\in {{\mathbb{R}}}^{{D}_{h}\times D}\) and \({W}_{4}\in {{\mathbb{R}}}^{{D}_{h}\times {D}_{e}}\) are learnable weight matrices, \(d\) is the feature dimension, and \({\alpha }_{i,j}\) represents the attention weight coefficient between node i and node j. The embeddings of node i can be then updated as:$${F}_{i}={mean}\left[{{|}}{{{|}}}_{k}^{K}\left({X}_{i}^{* }+\sum _{j\in N\left(i\right)}{\alpha }_{i,j}^{k}\left({W}_{5}^{k}{X}_{j}+{W}_{6}^{k}{e}_{i,j}\right)\right)\right]$$
(4)
where K is the number of attention heads, \({\alpha }_{i,j}^{k}\) is the attention weight coefficient calculated by the \({k}^{{th}}\) attention mechanism, and \({W}_{5}^{k}\in {{\mathbb{R}}}^{{D}_{h}\times D}\) and \({W}_{6}^{k}\in {{\mathbb{R}}}^{{D}_{h}\times {D}_{e}}\) are the weight matrices for corresponding input linear transformation, respectively. Overall, for a given graph node i with node feature matrix \({X}_{r}\), adjacency matrix \({A}_{r}\), and edge feature matrix \({E}_{r}\), the embeddings Fr can be obtained through the encoder:$${F}_{r}={Enc}\left({X}_{r},{A}_{r},{E}_{r}\right)$$
(5)
GCN with ARMA filtersThe decoders in pre-training module and the encoder for atomic feature encoding adopt the GCN with ARMA filters, which is more robust to noise, and more effectively captures the global graph structures40. ARMA filter consists of multiple parallel stacks \(N\), each stacking multiple convolutional layers \(T\). For a given graph node i, the embedding was given by:$${X}_{i}^{{\prime} }=\frac{1}{N}\mathop{\sum }\limits_{n=1}^{N}\left({X}_{n}^{\left(T\right)}+\sum _{j\in H\left(i\right)}Q{e}_{i,j}{X}_{j}\right)$$
(6)
where \(Q\in {{\mathbb{R}}}^{D\times {D}_{e}}\) is the learnable weight matrix for edge features, \(H(i)\) is the set of neighboring nodes of node i, and \({X}_{j}\) is the feature representation of the neighboring node j. \({X}_{n}^{(T)}\) represents the encoded feature representation of node i in the nth stack and the Tth convolutional layer. It can be iteratively computed as:$${X}_{n}^{\left(T+1\right)}=\sigma \left(\hat{L}{X}_{n}^{\left(T\right)}W+{X}^{\left(0\right)}H\right)$$
(7)
where \(\sigma\) is the activation function, typically using ReLU, \(W{{\mathbb{\in }}{\mathbb{R}}}^{{D}_{h}\times {D}_{h}}\) and \(H\in {{\mathbb{R}}}^{D\times {D}_{h}}\) are both learnable weight matrices updated during network iterations, and \({X}^{(0)}\) is the initial feature. \(\hat{L} \, {{\mathbb{\in }} \, {\mathbb{R}}}^{M\times M}\) is the modified Laplacian matrix calculated from the degree matrix \(D\in {{\mathbb{R}}}^{M\times M}\) and the adjacency matrix \(A\):$$\hat{L}={D}^{-\frac{1}{2}}A{D}^{-\frac{1}{2}}$$
(8)
Self-supervised mask schemeIt has been demonstrated that self-supervised mask learning can enable the model to learn more informative representations from unlabeled data to achieve better performance, generalization, and robustness on various downstream task32,33. In this study, both the mask and de-mask strategies were employed to generate more representative feature embeddings. In specific, we performed mask operations on 30% of the nodes in a graph, and took the graphs without and with masks as the input of teacher and student encoders, respectively. Given the unmasked graph structure feature matrix Xt, adjacency matrix At and edge feature matrix Et, the output feature representations Ft of the teacher network can be calculated as:$${F}_{t}={TeaEnc}\left({X}_{t},{A}_{t},{E}_{t}\right)$$
(9)
The mask operation preserves the connectivity between nodes and their neighbors. For the initialization of the feature vectors, we assigned learnable weights to all masked nodes, denoted as \({X}_{{mask}}\in {{\mathbb{R}}}^{1\times D}\), to replace their original features. The masked graph structure feature matrix Xm, adjacency matrix Am, and edge feature matrix Em were fed into the student encoder to obtain the feature representations of the masked amino acids:$${F}_{m}={StuEnc}\left({X}_{m},{A}_{m},{E}_{m}\right)$$
(10)
We then introduced the student decoder for the reconstruction of masked nodes. Here, a re-masking operation was applied to the feature representations of masked amino acids Fm before feeding it into the decoder to alleviate excessive smoothing issue associated with the accumulation of graph neural network layers. This operation re-masks the feature vectors of selected nodes above with another token \({X}_{{mask}}^{{\prime} }\in {{\mathbb{R}}}^{1\times {D}_{h}}\). The resulting re-masked feature representations of amino acids FDm was then input into the student decoder to obtain the reconstructed feature representations:$${F}_{{stu}}={StuDec}\left({F}_{{Dm}},{A}_{m},{E}_{m}\right)$$
(11)
Similarly, we also employed the re-masking strategy for teacher network. The re-masked feature representations FDt of Ft were fed into the teacher decoder to obtain the feature representations:$${F}_{{tea}}={TeaDec}\left({F}_{{Dt}},{A}_{t},{E}_{t}\right)$$
(12)
Subsequently, we used the mean squared error loss function to calculate the reconstruction loss between the masked node representations in \({F}_{{stu}}\) and their initial features:$${L}_{{ss}}={MSE}\left({\left({F}_{{stu}}\right)}_{i},{X}_{i}\right),i\in {N}_{m}$$
(13)
where Nm is the set of masked nodes.Self-distillation schemeThe key idea of the self-distillation is to improve performance by letting the model learn its own knowledge30,31. Here, a dual-branch student-teacher network was designed for pre-training. In order to keep global consistency of the graph features output by the student and teacher networks, we employed a distillation loss function, inspired by Barlow Twins77, to maximize the feature invariance. The loss function was defined as:$${L}_{{cc}}={\sum _{i}\left(1-{C}_{{ii}}\right)}^{2}+\lambda \sum _{i}\sum _{j\ne i}\frac{{C}_{{ij}}^{2}}{M}$$
(14)
where λ is used to adjust the strength of decorrelation. \(C\in {{\mathbb{R}}}^{M\times M}\) is the cross-correlation matrix, which was calculated as:$${C}_{{ij}} \, \triangleq \frac{\sum {F}_{{stu}}^{i}{F}_{{tea}}^{j}}{\sqrt{\sum {\left({F}_{{stu}}^{i}\right)}^{2}}\sqrt{\sum {\left({F}_{{tea}}^{j}\right)}^{2}}}$$
(15)
where i and j are the indices of vectors in Fstu and Ftea, respectively.In summary, by combining self-supervised mask learning and self-distillation learning, we ultimately used a loss function that includes both local node feature reconstruction and global similarity constraint to train the encoder. The overall training objective was:$${L}_{{pre}}={L}_{{ss}}+{L}_{{cc}}$$
(16)
The multi-branch network for ΔΔG predictionThe multi-branch network consists of multiple encoders for embeddings of amino acids, atoms, and protein sequences, respectively, aiming that the accurate prediction can be yielded through a more comprehensive set of feature representations. In specific, the embeddings Fr for amino acids were generated from pre-trained student encoder as defined in (5). The ARMA GCN was used to extract embeddings for atoms as:$${F\!}_{a}={ARMA}\left({X}_{a},{A}_{a},{E}_{a}\right)$$
(17)
where \({X}_{a}\), \({A}_{a}\), and \({E}_{a}\) are the node feature matrix, adjacency matrix, and edge feature matrix extracted from atom graphs, respectively. In order to further enrich the structural feature representations, we then integrated the residue-level and atom-level representations through self-attention pooling mechanism. For a given amino acid, the feature vector of all its corresponding atoms is denoted as \(\left\{{F}_{a1},{F}_{a2},\ldots ,{F}_{{an}},\right\}\). The attention weights were computed as:$${\alpha }_{A}={Softmax}\left(\left[\begin{array}{c}{F}_{a1}\\ {F}_{a2}\\ \vdots \\ {F}_{{an}}\end{array}\right]\times {W}_{A}\right)$$
(18)
where \({W}_{A}\in {{\mathbb{R}}}^{{D}_{a}\times {D}_{a}}\) is a learnable weight matrix, and \({D}_{a}\) represents the feature dimension of atomic nodes. The aggregation was then performed as:$${F}_{a}^{{\prime} }={Sum}\left(\left[\begin{array}{c}{F}_{a1}\\ {F}_{a2}\\ \vdots \\ {F}_{{an}}\end{array}\right]\times {\alpha }_{A}\right)$$
(19)
The structural representations were then concatenated as:$${F}_{{strut}}={Concatenate}\left({F}_{r},{F}_{a}^{{\prime} }\right)$$
(20)
Furthermore, we employed a CNN encoder to capture the protein sequence information for wild-type and mutant residues, respectively:$${F}_{{wild}-{type}}={CNNEnc}\left({S}_{{wild}-{type}}\right)$$
(21)
$${F}_{{mutant}}={CNNEnc}\left({S}_{{mutant}}\right)$$
(22)
where \({S}_{{wild}-{type}}\in {{\mathbb{R}}}^{{M}_{s}\times {D}_{s}}\) and \({S}_{{mutant}}\in {{\mathbb{R}}}^{{M}_{s}\times {D}_{s}}\) are the initial encodings of the sequence features for wile-type and mutant sequences, respectively. Ms and Ds represent the number of amino acids in the sequence and the feature dimension, respectively. The Tanh was used as the activation function, and the globally adaptive average pooling was used as pooling operation. Fwild-type and Fmutant are one-dimensional feature vectors for the wild-type residue and mutant residue, respectively. The sequence representation was then calculated as:$${F}_{{seq}}={F}_{{wild}-{type}}-{F}_{{mutant}}$$
(23)
The structural and sequence representations were then concatenated to ensure that a comprehensive set of information can be considered. The concatenated representations were then fed into a fully connected layer for the prediction of ΔΔG:$$\varDelta \varDelta G={FC}\left({Concatenate}\left({F}_{{struct}}^{{\prime} },{F}_{{seq}}\right)\right)$$
(24)
where \({F}_{{struct}}^{{\prime} }\) represents the one-dimensional feature vector corresponding to the wild-type residue in Fstruct.Evaluation metricsThe performance of PIANO was evaluated using two commonly used metrics in this field, including PCC and RMSE. PCC measures the linear correlation between the predicted and actual values, and is defined as:$${PCC}=\frac{{\sum }_{i=1}^{n}\left({x}_{i}-\bar{x}\right)\left({y}_{i}-\bar{y}\right)}{\sqrt{{\sum }_{i=1}^{n}{\left({x}_{i}-\bar{x}\right)}^{2}{\sum }_{i=1}^{n}{\left({y}_{i}-\bar{y}\right)}^{2}}}$$
(25)
where \({x}_{i}\) and \({y}_{i}\) represent the predicted and true values for the ith sample, respectively. \(\bar{x}\) and \(\bar{y}\) are the means of the predicted and true values, respectively, and n is the total number of samples. It produces a score ranging from −1 to 1. The scores of −1, 0, and 1 indicate a complete negative linear correlation, no correlation, and a complete positive correlation, respectively. RMSE quantifies the average deviation between the predicted and actual values, with a lower value representing higher prediction performance, it is defined as:$${RMSE}=\sqrt{\frac{1}{n}{\sum }_{i=1}^{n}{\left({y}_{i}-\widetilde{y}\right)}^{2}}$$
(26)
where \(\widetilde{y}\) and \({y}_{i}\) represent the predicted and true values, respectively, and n is the total number of samples.OR calculationThe OR is a statistical measure used to assess the degree of association between two events, particularly in case-control studies. The OR indicates the relative relationship between the odds of event occurrence in two different groups. An OR greater than 1 indicates that the event is more likely to occur in the first group, while an OR less than 1 suggests that the event is more likely to occur in the second group. The calculation formula for OR is as follows:$${OR}=\frac{{ad}}{{bc}}$$
(27)
where ‘a’ and ‘c’ represents the numbers of occurrences of the event in the first group and second group, respectively. ‘b’ and ‘d’ represent the number of non-occurrences of the event in the first group and second group, respectively. We calculated 95% confidence intervals from standard error:$$s.e.=\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}$$
(28)
The upper and lower bounds of the 95% confidence intervals were calculated as follows, respectively:$${{CI}}_{95 \% }^{{upper}}={e}^{{\mathrm{ln}}\left({OR}\right)+1.96\times s.e.}$$
(29)
$${{CI}}_{95 \% }^{{lower}}={e}^{{\mathrm{ln}}\left({OR}\right)-1.96\times s.e.}$$
(30)
Statistics and reproducibilityThe distribution of ΔΔG scores was analyzed using two-sided Mann-Whitney U test, a p-value less than 0.05 is statistically significant. The OR analysis was conducted at the 95% confidence level. For de novo missense mutations, the OR calculation for the comparison between different methods was performed at 50th, 70th and 80th percentiles, respectively. All data are available in Supplementary Data 1–5. The ΔΔG scores can be predicted by the released PIANO package. Statistical analyses were conducted using Python, with more detailed information regarding sample size, statistical tests, and p-values outlined in the “Results” section.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles