PCP-GC-LM: single-sequence-based protein contact prediction using dual graph convolutional neural network and convolutional neural network | BMC Bioinformatics

DatasetsBecause our method takes protein sequences as input, it is reasonable to train and compare all methods on the protein database without homologous information. In the SPOT-Contact-LM work, they selected the ProteinNet [30] dataset for trainnig, using a sequence identity cutoff of 95% to minimize redundancies and obtain sequences with maximum diversity. To effectively reduce the possible over-fitting, they separated 100 proteins from the ProteinNet set and compared them with other HMMS in protein [31]. When the E-value cut-off value is less than 0.1, the length is more than 500, and the final training and validation sets have 34 691 and 88 proteins, respectively. To make a better prediction on the long protein sequence, we remove the proteins with sequence lengths less than 100 on Validation Set and then predict this Validation-100 Set. Furthermore, for the test set, they also put forward the SPOT-2018 strict test set which has 669 proteins.In addition, we used other independent test sets like Casp14-FM and Casp15 target proteins.Although homologous-dimer proteins were not used as training in the proposed method, two independent test sets (DeepHomo test set and CASP-CAPRI test set) were also used to predict the performance of PCP-GC-LM to test the contact between complex proteins in the proposed method. DeepHomo test set only includes C2-symmetric homodimeric complex structures that have less than 30% sequence identity. And CASP-CAPRI dataset includes 28 homodimers from the target from the recent CASP-CAPRI competition.Performance evaluationOur research aims to predict which residue pairs in a protein are in contact. So in the Critical Assessment of Structure Prediction (CASP) [32] definition, residues are contacted when there is an inter-distance of 8 Å. The contact between two residues can be divided into three types: short-range (7–11 residues apart), medium-range (12–23 residues apart), and long-range (at least 24 residues apart). For each contact type, we also calculate the top L/k highest-ranked predictions of precision in the model which the L is the length of the protein sequence, and the k is usually to be 1, 2, 5, 10.Method comparisonIn our study, we will be comparing our protein contact prediction model with other existing models that are based on single-sequence protein data. One of the models that we will be comparing is Esm-1b, which is an excellent language model that has been trained on a large-scale protein dataset. This model is capable of providing sequence embedding for many downstream tasks, including protein contact prediction. Considering that the number of existing protein contact prediction models based on a single sequence is insufficient, we also compare with Esm-1b in some datasets, although it may not be fair. Another model that compares is SSCpred, which is a single-sequence-based contact predictor that performs prediction through the deep fully convolutional network (Deep FCN) without additional homology information. We have chosen the SSCpred code in an offline version, which can be downloaded from https://github.com/chenmc1996/SSCPred. In addition to these models, we will also be comparing our network with SPOT-Contact-LM and its different sub-models. SPOT-Contact-LM is a combined network that obtains different prediction results through different inputs and training strategies and averages the different prediction results. We will be comparing our network with the different sub-models under the SPOT-Contact-LM, especially when choosing a training strategy that is direct inter-residue contact prediction. Based on the size of the input feature and training strategy in the sub-model of SPOT-Contact-LM, it is named points SPOT-Contact-1 to SPOT-Contact-6, respectively. To provide a comprehensive comparison of the performance of various methods and the parameters of different models, we will be comparing our model with other models in the validation-100 set and other test datasets. Furthermore, we will also be comparing our model with DeepHomo and PGT in the DeepHomo test set and CASP-CAPRI test set. By comparing our model with these existing models, we aim to demonstrate the effectiveness and accuracy of our protein contact prediction model.Performance comparisonFeature and dual-graph importanceTo understand the influence of different input features and the branches of a dual graph, we also trained different models and compared their performance on the test datasets. Table 1 outlines the different models that we utilized in our study. Our findings, as depicted in Fig. 6, indicate that the models trained using one-hot code had a prediction accuracy of 13.8% and 12.2% for the top L/2 type in medium- and long-range contact. However, when we trained the model by adding representation to the one-hot code, we observed a significant improvement of 58.5% and 74.5% respectively. We also examined the effects of individual graphs in the graph encoder module. The only protein graph (PG) refers to the edges matrix generated solely by MLP from the edges feature matrix, without going through the edges update layers. On the other hand, the only complete graph (CG) does not generate the edges matrix. Our results, as shown in Figure 6, demonstrate that GP performs better than PG in most contact types. As a result, we have chosen one- hot encoding and representation from Esm-1b as inputs and adopted the dual-graph updating module as our best model.Table 1 Input feature vector composition from ablation experimentsFig. 6The mean prediction accuracy in SPOT-2018 set for Short, Medium and Long range contactResult comparisonsFirst, we evaluated the performance of our model on a validation set and compared it with other existing methods. Our model outperformed ESM-1b and spot-contact-LM and their sub-models in terms of three contact types and top/k (k = 1, 2, 5, 10). Figure 7 shows a detailed comparison of the validation set. We also analyzed the effect of different sub-modules of SPOT-Contact-LM and found that the performance improved with the increase of input features. For the long-range and top L/1 contact type, the mean prediction accuracy of spot-contact-1 to spot-contact-6 were 0.423, 0.448, 0.446, 0.433, 0.438, and 0.45, respectively, while the mean prediction accuracy of our method was 0.478. We also tested our model on the SPOT-2018 test set and compared it with other methods. Table 2 shows that our method performs much better than ESM-1b and the sub-model from spot-contact-LM with long-range type contact precision for length cut-offs of L/1, L/2, L/5, and L/10. In the L/1 of Long-range contact type, PCP-GC-LM was 11.51%,4.68%,0.21%,10.34%,6.8%, and 2.92% higher than Spot-Contact-1 to 6, respectively. Our method also improved the performance measures for other contact types compared to these sub-models of SPOT-Contact-LM. However, there is still room for improvement when compared with the final result of SPOT-Contact-LM. We believe that further research can help us to improve the accuracy of our model and enhance its performance in predicting protein contact maps.Fig. 7Comparison of our method and other methods on Validation-100 set for short-,medium- and long-range contactsTable 2 Comparison of our method, sub-model of SPOT-Contact-LM, and ESM-1b on SPOT-2018 set for medium-, long- contactsDuring the 14th edition of the Critical Assessment of Protein Structure Prediction (CASP14), several free modeling targets were released, and various methods were employed to predict inter-residue full-length contacts. For these methods, the simple sub-model from SPOT-Contact-LM achieved a contact precision of 0.1194 in Top L/2, while the SPOT-Contact-LM achieved 0.154. However, our method outperformed these methods by achieving a mean precision of 0.16. In the latest protein targets of CASP15, the average prediction accuracy of spot-contact-1 to spot-contact-6 in the long-range l/10 type was found to be 55.49%, 62.1%, 62.47%, 61.05%, 63.35%, and 63.04%, respectively. The final model of spot-contact-LM achieved a precision of 63.81%. our method outperformed all these methods by achieving a precision of 64.74%. These results demonstrate the effectiveness of our method in predicting inter-residue full-length contacts accurately. In summary, our method has proven to be a reliable and effective approach for predicting inter-residue full-length contacts in protein structure prediction. The details of the precision of different methods are shown in Figs. 8 and 9.Fig. 8Precision-based comparison of SPOT-Contact-LM and our method on CASP14-FM set for long range contactFig. 9Precision-based comparison of SPOT-Contact-LM and our method on CASP15 target set for long range contactThe SPOT-Contact-LM sub-model has been designed with two strategies: direct inter-residue contact prediction and inter-residue distance bin prediction. In order to evaluate the effectiveness of these strategies, we conducted a comparison with our own method using validation databases. Our findings suggest that the direct inter-residue contact prediction strategy performs marginally better than the interresidue distance bin prediction strategy. However, when it comes to long-range predictions, our method outperformed the direct contact prediction strategy. The difference in performance between the two strategies was found to be smaller for medium-range predictions. To illustrate this, we have included precision comparisons of the two training strategies and our method for medium- and long-range predictions in Fig. 10. Overall, our results suggest that our method is more effective than the direct contact prediction strategy, particularly for long-range predictions.Fig. 10precision comparison of two training strategies for medium-range, and long-range on the Validation-100 setFurthermore, the comparison of the parameters and running time between our model and the other two models, SSCpred and Spot-Contact-LM, is presented in Table 3. It is evident that our model has fewer parameters and requires less running time compared to the other two models. This indicates that our model is more efficient and computationally less expensive. This is a significant advantage, especially when dealing with large-scale protein structure prediction tasks. Our experiment employs the Ubuntu operating system and utilizes the Pytorch deep learning development framework. The central processing unit (CPU) used is the 11th Gen Intel(R) Core(TM) i9-11900K, while the graphics processing unit (GPU) employed is the Nvidia GeForce 3090Ti.Table 3 Comparison of the parameters and running time between our model and the other two modelPerformance in complex proteinProteins are essential biomolecules that play a crucial role in various biological processes. They carry out their biological functions by interacting with other biomacromolecules, such as DNA, RNA, and other proteins. In particular, protein-protein interactions are critical for the formation and function of protein complexes, which are involved in many cellular processes, including signal transduction, gene regulation, and metabolic pathways. However, predicting the three-dimensional structure of protein-protein complexes remains a major challenge in structural biology. This is because the structure of a protein complex is determined by the interactions between individual subunits, which can be highly complex and dynamic. Therefore, some work has focused on interchains contact prediction to help predict protein complex structure due to the importance of residues to residues interactions between individual subunits of protein complexes.The results of our study, as presented in Table 4, demonstrate that our model is highly effective in predicting contacts in complex proteins. Specifically, in the Deephomo test dataset, our model achieved a contact prediction accuracy of 78.78 in the top/10, which is significantly higher than PGT’s accuracy of 67.33. This indicates that our model outperforms PGT in terms of predicting contacts in complex proteins. Furthermore, our model also outperformed other methods in the CASP-CAPRI dataset, as shown in Table 5.Table 4 Precision on the DeepHomo test DataSet (300)Table 5 Precision on the CASP-CAPRI datasetsPerformance analysis of edges matrixTo compare the effects of interaction matrices generated by intermediate components and construct heatmaps for observation, we can analyze the heatmaps of real contacts on two example proteins, namely 2GGE and 2XZ4, and compare them with the edge matrices generated by PCP-GC-LM. This allows us to visualize the real contacts in these proteins through the interaction matrix heatmap. The results for these example proteins are shown in Figs. 11 and 12, respectively, and each set of images consists of three different pictures.Fig. 11Comparison of the outputs for 2XZ4 protein by PCP-GC-LM as labeled. In these three columns, the left side is the edge matrix, the middle is the final output from PCP-GC-LM, and the right is the true contact from example protein 2XZ4Fig. 12Comparison of the outputs for 2GGE protein by PCP-GC-LM as labeled. In these three columns, the left side is the edge matrix, the middle is the final output from PCP-GC-LM, and the right is the true contact from example protein 2GGEThe left image represents the connection matrix obtained through our model, and the heatmap quantifies the interaction relationship of residue pairs at different positions. The darker the color means the higher the vector value, indicating a higher likelihood of interaction of different positions in the protein sequence. And the middle image represents the final protein contact prediction matrix from our model, while the image on the right represents the real protein contact matrix generated by our model. In both heat images, the closer the color is to white, the higher the probability of contact, whereas closer to blue indicates a lower probability of contact.Performance in protein distance predictionBased on this model, we have enhanced the architecture to predict the distance between proteins. Unlike contact prediction, which tackles the probability of contact between residues and can be approached as a regression or binary classification problem, distance prediction involves categorizing distances into multiple bins, making it a multi-classification task. In our protein distance prediction model, the overall structure is similar to the contact prediction model, with the only difference being the utilization of the SoftMax function instead of the LeakReLU function, as distance prediction involves multiple classes and improves the depth of our model. We have selected four target protein domains (such as 6G7G_A, 6BTM_D, 6AHQ_L, 6BTC_B) for predicting protein distances. Figure 13 illustrates a specific comparison diagram, where each combination show cases the actual distance on the left side and the predicted distance on the right side. The distance range is limited to [0, 20]. The distance map reveals that the distance prediction matrix offers more spatial information than the contact matrix. The whiter the color distribution in Fig. 13, the farther the distance between residues, while a shift towards green and blue indicates a closer distance between residues.Fig. 13Comparison of the outputs for these proteins by PCP-GC-LM as labeled. In different combinations, the left side is the real protein distance distribution and the right side is the predicted distance distribution by our distance model

Hot Topics

Related Articles