Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein

Architecture of ESM-DBPAs shown in Fig. 1c, the architecture of ESM-DBP consists of three main steps:Fig. 1: ESM-DBP is a DBP field-specific PLM by domain-adaptive pretraining based general protein language model, i.e., ESM2, on the DBP dataset.a Data distribution of original ESM2 and DBP domain-specific knowledge; b The proportion of DBPs in the original pretraining dataset of ESM2, i.e., UniRef50; c The overall architecture of ESM-DBP. The protein structures are drawn using pymol81.Step 1. Data Preparation: a pretraining data set named UniDBP40 containing 170,264 non-redundant DBP sequences is constructed from all DNA-binding sequences in the UniProtKB database until 2 August 2023 using the CD-HIT tool with a cluster threshold of 0.4;Step 2. Domain-adaptive Pretraining: the original ESM2 model with 650 million parameters is trained on the UniDBP40 by self-supervised learning. Only the parameters of the last four transformer blocks and the logistic layer for classification are updated;Step 3. Fine-tuning Downstream Task: the biological feature representation contained important DBP knowledge of well-trained model is applied to four downstream DBP-related prediction tasks, i.e., DBP, DBS, TF, and DBZF predictions using the lightweight bidirectional long short-term memory (BiLSTM) and multi-layer perceptron (MLP) networks.The universal PLM ESM2 is formulated leveraging approximately 65 million non-redundant protein sequences through masked pretraining, culminating in the accurate prediction of protein tertiary structure at the atomic-level. Similarly, the proficiently trained ESM2 model holds promise for protein function prediction owing to the conservation and specificity inherent in protein functions concerning both sequence and structure. Nonetheless, given the intricacies and multifaceted nature of protein functionalities, focusing on a particular function such as DBP unveils a challenge. The extensive training set tends to inundate the model with superfluous data, detracting attention from function-specific nuances. Illustratively, Fig. 1b underscores the scant representation of DBP within the pretraining dataset of ESM2, i.e., UniRef50, constituting merely 0.0069%. Here, our objective is to bolster the sensitivity and acumen of the PLM towards DBP-associated knowledge by further training the ESM2 model on abundant DBP sequences. This targeted approach aims to refine DBP sequence characterization and subsequently enhance performance in the following prediction tasks. As depicted in Fig. 1a, the original PLM model encompasses a diverse array of protein attributes, encompassing structural features and diverse functions. We endeavor to steer the focus of the model towards regions overlapping with the DBP domain, thereby engendering a deeper comprehension of DBP-specific knowledge and, consequently, augmenting the accuracy of pertinent tasks. Statistically, out of the 170,264 DBPs in UniDBP40 used for pretraining the ESM-DBP, there are 118,912 (69.84%) intersections with the original UniRef50 dataset, and the remaining 51,352 (30.16%) ones belong to UniDBP40 privately. This underscores that ESM-DBP not only reinforces the representation of DBP knowledge ingrained within ESM2 but also assimilates crucial discriminatory insights from previously unseen DBP sequences.The ESM2 model includes a total of six versions with parameter counts of 8 million, 35 million, 150 million, 650 million, 3 billion, and 15 billion. The 3 billion and 15 billion versions of ESM2 spend 30 and 60 days on 512 NVIDIA V100 GPUs, respectively. Constrained by computational resources, most researchers could not afford to train the networks of this enormous magnitude, so the ESM2 model with 650 million parameters is used to continue training here. In addition, at least our previous study about nucleic acid-binding residue prediction shows that the performance of the model with large parameters is not necessarily better than that of the model with relatively small parameters26. We follow the standard masked pretraining approach as ESM2, that is, randomly masking 15% of the residues in the input sequence and then predicting them. The ESM2 consists of 33 layers of stacked transformer encoder blocks. Given that the training dataset of ESM2 is much larger than the dataset of this study, to prevent overfitting, we froze the parameters of the first 29 layers of the ESM2 and updated the weights of only the last four layers of the transformer encoder block as well as the logistic layer for amino acid classification prediction. The first 29 layers retain the important biological property knowledge learned by the original ESM2 from the massive protein primary sequences, and the last four layers learn DBP field-specific information through domain-adaptive pretraining. During the training phase, to solve the problem of the unequal length of protein sequences, the sequences with amino acid number greater than 512 are cut into sub-sequences, and the sequences with amino acid number less than 512 are filled with the token of “<pad > ” to the length of 512. The batch size is set to 100. The loss function and optimizer are consistent with ESM2. The model is trained for approximately 51k steps and took about 3 days on four Tesla V100 GPUs with 16 G memory.Role of domain-adaptive pretraining on performanceTo observe the impact of domain-adaptive pretraining on the prediction performance, we compare the prediction results of the original ESM2 model and the ESM-DBP for four downstream tasks. In particular, the feature representations of the above two models of size \(L\times {{\mathrm{1,280}}}\) are first extracted as input of neural network of four prediction tasks separately, where \(L\) denotes the number of amino acids in the protein sequence; then, the prediction results of DBS, DBP, and TF are obtained by independent validation, and the prediction result of DBZF are obtained by 10-fold cross validation since the DBZF dataset is not divided into training and test sets.Table 1 shows 7 detailed evaluation indexes of two models. ESM-DBP achieves an overwhelming advantage regardless of the prediction task. Take DBP prediction as an example, the Sen, Spe, Acc, Pre, F1, MCC, and AUC values of ESM-DBP are 92.12, 93.43, 93.35, 92.78, 0.927, 0.856, and 0.964, which are 2.62, 3.49, 3.44, 3.06, 3.00, 6.86, and 1.15% higher than those of ESM2, respectively. Moreover, the MCC values of ESM-DBP on DBS, TF, and DBZF predictions are 0.430, 0.924, and 0.349, which are 11.68, 4.17, and 46.02% higher than those of the original ESM2, respectively. Figure 2a shows the ROC curves of ESM-DBP and ESM2 on four prediction tasks. The ESM-DBP has higher AUC values than the ESM2 for all four tasks, especially for DBS prediction where the former achieved an improvement of 5.01% over the latter. In Fig. 2b, we know that the ESM-DBP has better predictive results than ESM2 on most samples. Take DBP prediction as an example, out of 381 DBPs (or non-DBPs), there are 308 (or 319) cases in which ESM-DBP has a higher (or lower) prediction probability than ESM2. This result means that for about 80% of the DBPs, ESM-DBP has a better confidence level. Meanwhile, to observe the overall distribution of prediction results for ESM-DBP and ESM2, we make the violin plot of predicted probability for positive and negative samples for each of the four tasks (see Fig. 2c). The distribution of the positive samples shows that the overall prediction scores of TF and DBP are significantly higher than those of DBS and DBZF predictions. The main reason for this is that the specific discriminative patterns of DBP and TF from primary sequences are more easily captured by neural networks compared to DBS and DBZF predictions. We also note that the lower end of the DBS positive sample has a higher density, suggesting that the model is not able to make a confident judgment on these residues. Looking more closely, ESM-DBP is less dense in this part than the ESM2. This suggests that the PLM after domain-adaptive pretraining mines some of the useful hidden patterns of DNA-binding residue from the large number of DBP sequences. By looking at the internal box plots of the two models on the DBP predictions, the box of ESM2 on the positive (or negative) sample is obviously lower (or higher) than that of ESM-DBP. This shows that ESM-DBP has a better discriminatory ability than original ESM2 both on DBP and non-DBP predictions. Figure 2d visualizes the feature representation of ESM2 and ESM-DBP on the data set of TF prediction before supervised learning using the T-SNE algorithm. By comparison, it is easy to see that the ESM-DBP makes a much clearer distinction for most TFs compared to the original ESM2 both on the test set and the training set. Only a small percentage of TFs are confused with non-TFs, and most cases can be distinguished by the naked eye from the projection of ESM-DBP. On the contrary, the projection of ESM2 looks worse than ESM-DBP. Many positive and negative samples are so entangled that they are difficult to identify with the naked eye, especially in the training data set. This differential result gives us reason to believe that after domain-adaptive pretraining, PLM spontaneously senses specific discriminatory information of transcription factor function in a large number of DNA-binding protein sequences. Thus, the PLM model is still able to make different representations of TF and Non-TF without being guided by labels. As a conclusion, after retraining on DBP domain-specific data, the PLM mines useful knowledge related to DBP, which enhances the ability of the model to characterize DBP and thus significantly improves downstream prediction task performance.Table 1 Performance comparison of original ESM2 and ESM-DBP on four DBP-related prediction tasksFig. 2: ESM-DBP provides better discrimination than the original ESM2.a ROC curves of ESM-DBP and ESM2 on four prediction tasks; b Scatterplots of ESM-DBP and ESM2 on four prediction tasks. Each point represents a sample and the number indicates the number of positive or negative samples located in the diagonal, upper, or lower triangle. The axes of the DBS represent the MCC values for each protein and the other three axes indicate the probability of a sample being predicted as a positive; c Distribution of the probability of each sample from benchmark test sets being predicted as a positive, left panel for the positive samples and the right panel for the negative samples. The numbers of positive (and negative) samples for DBP, DBS, TF, and DBZF tasks are n = 381 (and 381), 734 (and 14,021), 106 (and 106), and 834 (and 789), respectively. Centre means median, bounds of box indicate 25th and 75th percentiles, and whiskers indicate minima and maxima; d T-SNE82 visualization of feature representations of ESM-DBP and ESM2 on TF data set before supervised learning; e Role of the number of updated transformer blocks on prediction performance of training data set via the 10-fold cross validation. The number of parameters per transformer block of ESM2 is about 20 million. Source data are provided as a Source Data file.Determining the number of parameters to be updatedTo achieve highly accurate protein 3D structure prediction, the original ESM2 is constructed on ∼65 million protein sequences, which are almost 382 times the training set of this study. To fit such a large amount of data, the largest ESM2 model has a parameter count of 15 billion. Although the ESM2 model used in this study is a 650 M parameter version, given the limited number of sequences in our training set, it is still inappropriate to directly update all parameters in the model. On the one hand, the well-trained PLM contains important knowledge of the underlying biological of proteins learned from a huge number of primary sequences. This critically important biological information will be corrupted leading to deterioration of sequence characterization if all parameters are updated. On the other hand, using such a huge neural network to fit a relatively small amount of training data will undoubtedly lead to overfitting and thus learning redundant noise, which can also lead to worsening prediction performance. In this study, to take advantage of both the underlying biological feature attributes already learned by the original ESM2 and the specificity knowledge in the DBP sequences, we freeze most of the transformer blocks in ESM2, leaving only a small number of updateable parameters. However, it is not an easy task to determine the proper number of updatable parameters to fit the existing training data, especially considering the huge number of parameters of the original ESM2 model and the complex non-linear relationship between protein sequences and DBP functions.The number of updated transformer blocks in the final ESM-DBP model is optimized through 10-fold cross validation on the training sets of four downstream tasks. We incrementally increased the number of updated blocks with a step of 1 for the first six layers. Beyond the sixth layer, we employed a growing step to gradually expand the number of blocks to be updated. According to Fig. 2e, it is evident that performance deteriorates after updating most of the transformer blocks, irrespective of the task. Particularly, updating all layers of ESM-2 results in the poorest performance. This outcome reinforces that the fundamental biological knowledge encoded by the original ESM2 model, derived from approximately 65 million protein sequences, is destroyed. Optimal performance is observed when four blocks are updated, though the performance gain is modest. Therefore, for this study, the number of updated transformer blocks in the ESM-DBP model is set at four. Given that each transformer block in ESM2 contains about 20 million parameters, the total number of parameters to be updated in ESM-DBP is estimated at approximately 80 million.Performance comparison of different algorithms on four downstream tasksChoosing the appropriate algorithms as the classifiers for the four downstream tasks is crucial for achieving optimal prediction performance. Given the different types of the four downstream tasks, we designed two classifiers tailored to the characteristics and validated our choice through experiment comparison. Concretely, for DBP and TF prediction tasks, consider that they are protein-level prediction tasks, that is, each protein is a sample with only one label. The original sequence embedding of the ESM-DBP language model is a matrix of size \(L\times {{\mathrm{1280}}}\), and the number of sequences in the supervised training set for these two tasks is only 9000 and 829, respectively. To mitigate overfitting to such a large number of features, we average the first dimension of the original embedding matrix to obtain a vector of length 1280 as the input feature for each protein. It is worth noting that this feature size may not be compatible with certain large-scale neural network architectures such as CNN, RNN, and transformer, which typically expect long or large matrices as input. Consequently, we opted for the lightweight and straightforward MLP models as the classifiers. For residue- and local sequence-level prediction tasks like DBS and DBZF, the model needs to consider the context-dependent features of residues. Both BiLSTM and transformer based on self-attention mechanism are well suited for mining such discriminative information from sequence feature matrices. However, the latter is more appropriate for large data volume scenarios, whereas our DBS and DBZF training sets contain only 543 and 157 proteins, respectively, and thus BiLSTM is considered as the classifier here.To be more convincing, we perform 10-fold cross validation using different algorithms on the training set for each of these four downstream tasks. In particular, for DBP and TF predictions, 6 diverse algorithms, namely, Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gaussian Naive Bayes (NB), Gradient Boosting Decision Tree (GBDT), and Decision Tree-based Adaptive Boosting (AdaBoost) are used as comparisons. For DBS and DBZF predictions, self-attention mechanism, Bidirectional Recurrent Neural Network (BiRNN), and Bidirectional Gate Recurrent Unit (BiGRU) are used as comparison algorithms. The former are widely used classical algorithms in machine learning, and we use the default parameters in sklearn for training these classifiers. The latter are currently popular deep learning neural networks widely used in the NLP field. To ensure fairness, the hyperparameters are kept consistent with the default implementation of ESM-DBP, except for the different core algorithms. The specific experimental results are shown in Table 2. By looking at Table 2, the model using MLP significantly outperforms the other algorithms on the DBP and TF prediction tasks. Specifically, the MCC value of MLP on DBP (or DBZF) prediction task is 0.846 (or 0.903), which is 5.35, 4.06, 13.40, 59.32, 5.09, and 22.25% (or 6.61, 4.88, 9.19, 11.48, 5.12, and 7.63%) higher than those of other six algorithms, respectively. For DBS and DBZF predictions, the MCC values of BiLSTM are 0.462 and 0.349, achieving the improvements of 1.73 and 8.39% over the second-best algorithms separately. Notably, the self-attention-based network performs poorly on both prediction tasks. The underlying main reason for this is the large number of parameters, which tends to cause overfitting on small datasets. Based on the experimental results, we further determine that the MLP and BiLSTM networks are appropriate as the classifiers in downstream prediction tasks in this study.Table 2 Comparison of different algorithms on the training sets of four prediction tasks over the 10-fold cross validationPerformance comparison with HMMHidden Markov model file (HMM)34 is widely used in various protein biological function and structure studies due to the conservation information it encapsulates. For instance, much prominence of AlphaFold2 can be attributed to the high quality of MSA information. Here, we apply the HMM feature to four prediction tasks and compare it with the ESM-DBP feature representation. To be specific, for each protein sequence of length \(L\), we first search the Uniclust30 database37 using the HHblits program34 with an e-value of 0.1 to obtain the original HMM profile; then, to make it suitable for training neural networks, each element in HMM profile is normalized in turn by two functions, i.e., \(g(x)\) = \({2}^{-x/1000}\) and \(f(x)\) = \(1/\left(1+{e}^{-x}\right)\), resulting in an HMM matrix of size \(L\times 30\); finally, for DBS, DBZF, DBP, and TF predictions, the first two and the latter two use the normal HMM matrix and the pseudo HMM matrix29 as inputs to the models same as ESM-DBP, respectively.Table 3 presents a detailed performance comparison, demonstrating that the model using ESM-DBP feature embeddings significantly outperforms HMM across all tasks. The AUC values of ESM-DBP on four tasks are 0.964, 0.905, 0.981, and 0.711, which are 15.86, 10.09, 7.68, and 12.50% higher than that of the HMM feature, respectively. The improved performance is due to the effective detection of evolutionary information within protein families by ESM-DBP. As an artificially computed representation of evolutionary information features, HMM may not carry more useful knowledge compared to ESM-DBP, at least in the four tasks of this study.Table 3 Performance comparison of HMM and ESM-DBP on four DBP-related prediction tasksESM-DBP outperforms SOTA methods over four prediction tasksTo further assess the performance of the proposed ESM-DBP in predicting the four downstream tasks, 23 SOTA prediction methods, including 9 for DBP (MsDBP38, iDNAProt-ES39, LBi-DBP29, TargetDBP+40, TPSO-DBP28, iDRBP_MMC41, iDRBP_ECHF42, IDRBP-PPCT43 and iDRPro-SC44), 10 for DBS (NCBRPred45, CLAPE-DB22, ESM-NBR26, iDRNA-ITF46, DRNAPred47, iProDNA-CapsNet48, TargetDNA49, DNAPred50, GraphBind51, and PredDBR30), 3 for TF (TFPred52, DeepTFactor31, and Wang et al.’s work53) and 1 for DBZF (DeepZF32) are employed as control. For a fair comparison, we utilize consistent training datasets across the evaluated methods to assess ESM-DBP and previous methods under similar conditions. The equal training sets are used to retrain various versions of the downstream task models for ESM-DBP. Specifically, three paired training and test datasets for DBP prediction, three paired datasets for DBS prediction, one paired dataset for TF prediction, and one dataset for DBZF prediction (cross validation was performed) are collected (Supplementary Table 1). The overall training strategy aligns with the default implementation of ESM-DBP, except for zeroing out the loss of label-unknown disordered residues in the DBS training set YK17-Tr. Performance comparisons are presented in Fig. 3.Fig. 3: ESM-DBP outperforms SOTA methods on four prediction tasks.a Comparison of MCC and AUC values among ESM-DBP and SOTA methods on four downstream prediction tasks. Except for DeepTFactor and iDRBP_ECHF, for which the training sets are unaccessible, the training sets of the remaining control methods remain consistent with ESM-DBP (separated by dotted line); b Head-to-head comparisons among ESM-DBP and four SOTA methods on independent test sets. Each point represents a sample and the number indicates the number of positive or negative samples located in the diagonal, upper, or lower triangle. The axes of the DBS represent the MCC values for each protein and the axes of other three figures indicate the probability of a sample being predicted as a positive; The DBZF predictions contain only 781 positive samples and 612 negative samples, as that is all in the DeepZF results file; c The DBS prediction results of ESM-DBP and SOTA methods on the chain C of a trimer (PDB ID: 4ZCF) with an MSA depth of 253 against the Uniclust30 database37 using HHblits program34. The protein structures are drawn using pymol81. Source data are provided as a Source Data file.Figure 3a presents the comparison of MCC and AUC values between ESM-DBP and SOTA methods across four downstream prediction tasks. Each subfigure. is delineated with a dashed line to indicate that the methods are under the same training set, except for DeepTFactor and iDRBP_ECHF, whose training sets are not accessible. The training set of DeepTFactor encompasses most TFs in UniProtKB, and iDRBP_ECHF shares the same positive samples as iDRBP_MMC, ensuring that comparisons with these methods do not overestimate the performance of ESM-DBP. Notably, the upper portion of Fig. 3a shows that ESM-DBP (highlighted in lavender) consistently achieves higher MCC values than other predictors across all tasks and training sets. This robust evaluation metric underscores that ESM-DBP surpasses other methods even on unbalanced test sets. Among 10 DBS prediction methods, two are PLM-based like ESM-DBP: protTrans-based CLAPE-DB and ESM2-based ESM-NBR. For a thorough comparison, we retrain the prediction models using the sequence embeddings from protTrans and ESM2, respectively. When the underlying PLM remains constant, the proposed model yields MCC comparable to CLAPE-DB and slightly lower than ESM-NBR. On the contrary, using the enhanced pretrained ESM-DBP sequence embeddings as input features significantly improves MCC values over the two control methods, reaffirming the critical impact of domain-adaptive pretraining. Regarding the overall evaluation metric AUC, ESM-DBP is superior to other methods in all comparisons, except that it is marginally lower than GraphBind in the DBS prediction task. We hypothesize that the potential advantage of GraphBind may stem from its incorporation of protein 3D structures, which directly influence DNA-protein interaction patterns.Figure 3b illustrates the head-to-head comparison between ESM-DBP and four recent methods, i.e., CLAPE-DB, DeepZF, DeepTFactor, and LBi-DBP. Apparently, in most cases, ESM-DBP achieves better prediction results compared to either method. For DBP prediction, out of 381 DBPs (or non-DBPs), there are 183 (or 254) cases in which ESM-DBP has a higher (or lower) prediction probability than that of LBi-DBP. For DBS prediction, of the 129 DBPs, there are 92 cases in which ESM-DBP possesses a higher MCC value than that of CLAPE-DB. These experimental results further support the validity and advancement of the proposed method. Apart from that, chain C of a trimer (PDB ID: 4ZCF) with an MSA depth of 253 against Uniclust30 serves as a case study to evaluate the performance of five DBS prediction methods on sequences with extensive MSA coverage. As illustrated in Fig. 3c, it is evident that three methods reliant on evolutionary information fail to deliver accurate predictions. This may be attributed to the structural complexity of trimer and DNA complexes, coupled with the length of amino acid sequences, which may hinder the predictive capabilities of these methods. Conversely, the PLM-based prediction methods, CLAPE-DB (MCC: 0.159) and ESM-DBP (MCC: 0.440), outperform the evolutionary information-based methods, underscoring the robust DBS prediction capacity of PLMs for large complexes. Notably, the superior performance of ESM-DBP compared to CLAPE-DB likely results from its retraining on DBP sequence data, enhancing its sensitivity to DNA-binding regions. In summary, ESM-DBP exhibits advanced prediction performance across all four tasks, proving to be superior or comparable to existing SOTA methods.ESM-DBP explores the DNA-binding domainsThe ability of a protein to bind to DNA mainly comes from the DNA-binding domain (DBD) such as HMG box, Homeobox, Fork-head, and Nuclear receptor. Even for transcription factors, the ability to bind specific DNA sequences alone is usually a reflection of the ability to regulate transcription1. ESM-1b reports on the capability of PLM to explore the remote homology and the MSA information in the protein family of protein sequences. The pretraining dataset containing 170,264 DBPs in this study has an abundance of DBD sequences (see Supplementary Fig. 1). In general, those protein sequences that contain DBDs of the same class generally have higher homology and closer evolutionary distance. Theoretically, PLM can distinguish between DBPs and non-DBPs by capturing the commonality of DBDs of these similar DBP sequences. Here, to validate this idea, we first feed 142,657 single-domain protein sequences in UniDBP40 whose domain data can be found in UniProt into the ESM-DBP model to derive sequence embedding features; then, T-SNE algorithm is used to visualize these embedded features. From Fig. 4a, for those families with abundant sequences like HTH and Homeobox, clustering is effective and can be clearly distinguished. This supports the idea that domain-adaptive pretraining allows PLM to pay more attention to those families of proteins with a high frequency of occurrence than to common proteins. In addition, for the TFs in the test set, we use the integrated gradient algorithm54 to observe the contribution value of the residues at each position in the sequence to the prediction result (see Supplementary Data 1). In Fig. 4b, five single DBD human TFs (UniProtKB ID: P35713, Q9UBX0, P58021, Q9NX45, and Q10587) separately containing five typical DBDs, i.e., HMG box, Homeobox, Fork-head, bHLH, and bZIP, as well as one multi-DBD TF (UniProt ID: P51449) containing two domains, i.e., Nuclear receptor and NR LBD, are used as demonstration cases. By looking at Fig. 4b, the most intuitive feeling is that those residues in the DBD region have significantly higher saliency scores than those in the non-DBD region regardless of protein. For instance, residues between index 85 and 153 on protein P35713 are significantly darker in color than the rest of the residues and look linear. This suggests that residues in the DBD region contribute particularly to the prediction results, and it is because of the perception of this sequence fragment that PLM identifies TF. We also note that not all residues in the DBD region have significant saliency scores; in fact, only a small number of residues have a large effect, and in particular, only one residue is more prominently colored on Q10587. Moreover, the contribution of a residue in a non-DBD region is also evident on Q9UBX0. After all, DBD and non-DBD of a protein sequence do not exist in isolation but have an interactive relationship. Residues in the non-DBD region also may affect the binding pattern to DNA, although this effect is much less than that of residues in the DBD region in most cases. From the results of the multi-domain protein P51449, residues within both domains positively influence the final prediction, indicating that ESM-DBP not only probes the knowledge of single-domain proteins but also captures the specificity information of different DBDs of multi-domain proteins.Fig. 4: ESM-DBP explores the DNA-binding domains.a Visualization of ESM-DBP sequence embedding of different DBD families in UniDBP40; b The saliency map of six human TFs. The solid black line indicates the DBD area recorded in UniProt. The horizontal axis and vertical axis are the indexes of the residue in protein sequence and the transformer block in ESM-DBP. Each value with coordinates \((x,y)\) represents the saliency score calculated by the integrated gradient algorithm of the x th residue in the yth transformer block for the prediction result; The integrated gradient algorithm is implemented in the Captum library83; c Prediction results comparison of full feature (marked in diamond), DBD feature (marked in the circle), and non-DBD feature (marked in the triangle) on 91 test human TFs. The DBD (or non-DBD) feature is obtained by replacing all residues that are not in (or in) the DBD region with the token of <mask> before inputting the protein sequence into the ESM-DBP. Source data are provided as a Source Data file.To further demonstrate the role of DBD on the predictive performance of ESM-DBP, we compare the performance of three different features on 97 test human TFs, which are obtained by inputting the complete protein sequence, the sequence of the DBD region, and the sequence of the non-DBD region into ESM-DBP, respectively. In Fig. 4c, we can clearly observe that in most of the cases, the prediction probability of the non-DBD features drops drastically compared to the full features since the critical DNA binding pattern identification information is missing from the feature after masking out the sequences in the DBD region. In contrast, the prediction results using DBD features are not nearly as different from the full features since the integrity of the identification information in the DBD region. Undoubtedly, the inference of ESM-DBP depends greatly on the sequence fragments of the DBD region. To sum up, ESM-DBP perceives various DBDs hiding DNA binding patterns from a large number of DBP sequences in the pretraining dataset, which enables accurate recognition of DBPs and TFs.ESM-DBP is good at generalizabilityIn the section “ESM-DBP explores the DNA-binding domains”, we mention that ESM-DBP learns specific DBD knowledge from clusters of similar protein sequences to improve the prediction performance of related tasks. In nature, some orphan proteins with a short evolutionary history lack homologous proteins. This is the main reason that limits the performance of most protein function prediction methods including AlphaFold2. Here, we observe the relationship between sequence similarity of test protein against to pretraining data set and prediction performance. Specifically, for the 381 positive samples in the DBP test set, UniDBP40 and UniRef50 are used as target databases to search for homologous sequences using the Blast program with an e-value of 1e-06, respectively (see Supplementary Data 2).By visiting Fig. 5a, out of 381 DBPs, most of the proteins have only a small number of high similarity sequences against UniDBP40, moreover 155 cases have no high similarity sequences; due to the abundance and diversity of sequences in UniRef50, a high number of similar sequences can be found for most of the test proteins. Eighteen DBPs (marked in the pentagram) with a prediction difference between ESM-DBP and ESM2 greater than 0.4 are present in the gray highlighted region of Fig. 5b, most of which have few similar sequences with UniDBP40. In Fig. 5c, looking at the proteins that have abundant similar sequences against UniDBP40 and UniRef50 respectively (located in the upper and right halves respectively), all proteins in the former have good predictions, while some proteins with poor predictions exist in the latter. Considering the excessive number of sequences in UniRef50, ESM2 could not adequately focus on all DBPs. In contrast, ESM-DBP mined as much valid information as possible from UniDBP40 by domain-adaptive pretraining. Looking closely at those proteins at the bottom with less sequence similarity to UniDBP40, most of them still have good predicted probability values. Figure 5d shows the predicted results of 155 DBPs without any similar sequences and 226 DBPs in the presence of homology proteins against UniDBP40 respectively. From the two sets of box plots at the top of Fig. 5d, both ESM2 and ESM-DBP have much higher predicted probabilities on high homology proteins than those on low homology proteins, since the model detects those similar sequence fragments many times during the pretraining period and thus gives the more attention to them. Comparing the results of ESM2 and ESM-DBP on low homology proteins, the overall distribution of ESM-DBP is significantly higher than that of ESM2. These phenomena demonstrate that ESM-DBP mines key knowledge from those DBPs with only a small number of similar sequences, allowing it to further enhance the DNA-binding pattern characterization based on ESM2 for better prediction.Fig. 5: ESM-DBP performs well on DBPs with few homologous proteins.a Statistics on the number of homologous proteins of 381 DBPs in the independent test set against UniDBP40 and UniRef50 respectively using the Blast program35 with an e-value of 1e-06; b Head-to-head comparison between the prediction probabilities of ESM-DBP and ESM2 on the 381 DBPs in the independent test set. Pentagrams highlighted in grey shading indicate those 18 DBPs for which the difference between the predicted probabilities of ESM-DBP and ESM2 is greater than 0.4; c The relationship between the prediction probability of DBP and its homology against the two pretraining datasets; d Prediction results of ESM2 and other methods on DBPs in independent test set, among them high homology DBPs (n = 226) are those that have similar sequences against UniDBP40 (number > 0) and low homology DBPs (n = 155) have no similar sequences. Centre line means median, bounds of box indicate 25th and 75th percentiles, and whiskers indicate minima and maxima; e ESM-DBP learns discriminatory knowledge from three different types of low homologous proteins, i.e., bZIP protein, disordered protein, and protein without domain. NUniDBP40 and NUniRef50 mean the number of similar sequences against UniDBP40 and UniRef50 separately. “Probability” indicates the probability of being predicted as a DBP by ESM-DBP. Source data are provided as a Source Data file.We also compared with the SOTA DBP prediction methods that mostly rely on high-quality MSA information on test DBPs with different homology levels (see Fig. 5d). Obviously, most other predictors show a considerable decrease in prediction scores on low homology DBPs compared to high homology DBPs. The limited available evolutionary information is the main reason that constrains the generalizability of these traditional methods. In contrast, after continued training and full exploration on a large number of DBP sequences, ESM-DBP is able to capture valid discriminatory information about DBPs that have even only a small number of homologous sequences. The leading performance of ESM-DBP on these DBPs further demonstrates the generalizability that other methods lack.We ulteriorly perform the integrated gradient algorithm on those DBPs in UniSwiss-Tst whose numbers of similarity sequences against UniDBP40 and UniRef50 are 0 and less than 10, respectively (refer to Supplementary Fig. 2). In Fig. 5e, three cases of them (UniProt ID: Q3KSS8, P16790, and Q9GKQ4) are employed to demonstrate the ability of ESM-DBP to learn the discriminatory information of various low homology proteins. For ease of observation, the second dimension of the significance matrix was summed to obtain a vector with the same length as the protein sequence, where the height of residue at each position indicates the contribution of that residue to the prediction (whether the sequence is a DBP). The first two cases contain the bZIP domain and the disorder region, respectively; the latter has no domain recorded in the UniProt database. The saliency maps for the first two cases show the high sensitivity of ESM-DBP to the bZIP domain and disorder region. The bZIP is a typical DBD common among TFs in eukaryotes. Previous studies have reported the important effects of intrinsically disordered regions on DNA-protein interactions55,56,57,58,59 and the C-terminal region of P16790 (DNA polymerase processivity factor) is required for viral DNA synthesis60. The specific recognition of both highlights the capability of ESM-DBP to draw valid knowledge from protein domains in low homology sequences. The latter exemplifies the ability of ESM-DBP to identify DBP even in the absence of typical DBDs in the orphan protein. The positive saliency score for the vast majority of residues in the entire sequence highlights the ability of ESM-DBP to capture recognition knowledge from a global sequence rather than a local, specific protein domain. These results firmly support the idea that ESM-DBP can learn DBP discriminative information for proteins with few homologous sequences. The generalization ability demonstrated by ESM-DBP provides a potential new perspective for solving the problem of structure and function prediction of orphan proteins.Validation of prediction by ChIP-seqTo further validate the efficacy of ESM-DBP, we apply it to a dataset comprising 183,801 unreviewed human sequences with low annotation scores retrieved from the UniProtKB/TrEMBL database. Among these sequences, 14,412 (7.8%) are predicted as DBPs by ESM-DBP. Additionally, within this dataset, 974 putative DBPs are inferred primarily through homologous sequence analysis. Remarkably, ESM-DBP accurately predicts 946 (97.1%) of these putative DBPs, underscoring its robustness in identifying proteins with homologous DBPs. Nonetheless, predicting sequences with limited homology has historically posed a challenge for computational methods.To experimentally evaluate the performance of ESM-DBP in predicting DBP with low homology, the Blast program is first utilized to search homologous sequences of the predicted DBPs against UniRef50 and those proteins with low homology are screened out. After assessing the experimental feasibility, we select two sequences (UniProt IDs: E5RK24 and K7EK85; Gene IDs: SUPT6H and GTF2E2) that lack sufficient experimental evidence and function annotation to perform ChIP-seq in Hela cell line using the Rabbit antibodies of SUPT6H (CST, #15616, 1:50 dilution) and GTF2E2 (GeneTex, Cat. No. GTX105029, 1:100 dilution). The numbers of similar sequences of E5RK24 and K7EK85 against UniRef50 are 19 and 2, respectively, and none of their ChIP-seq data on Hela cell line are recorded in the Cistrome database61.The MACS tool62 with a P-value threshold of 0.001 is employed for screening significant binding peaks of the two protein sequences (refer to Supplementary Data 3, 4). The widespread distribution of peaks across chromosomes confirmed extensive interactions with DNA (refer to Fig. 6a). In total, 2391 and 5348 binding peaks are identified for SUPT6H and GTF2E2, respectively, with peak lengths ranging from 200 bp to 1200 bp (refer to Supplementary Fig. 3). We also count the distribution of binding peaks over the functional elements of genes (refer to Fig. 6b). The presence of binding peak-associated exon and promoter indicates the role of the two proteins in influencing the regulation and expression of genes. Several cases of peaks are depicted in more detail for each protein respectively (Supplementary Figs. 4, 5). For example, GTF2E2 demonstrates the oncogenic role in several cancers63,64,65. We find the peak_1268 of GTF2E2 locates in the gene body of TNK2 (also known as activated Cdc42-associated kinase 1 (ACK1) (refer to Supplementary Fig. 4a), which confers metastatic properties on cancer cells and promotes tumor growth by regulating tumor suppressor like WWOX and pro-survival factors like AKT66,67. Also, the peak_94 of GTF2E2 is located in the exon of PLA2G5 (refer to Supplementary Fig. 4b). This gene demonstrates a regulation role for the secretion of phospholipase A2 group V which is considered to induce inflammation in neighboring cells68. These binding cases might give more evidence about the crucial oncogenic influence of GTF2E2 through regulating the oncogene AKT and inflammation process. For another case, SUPT6H, we discovered that its peak_67 is located in the promoter region of KCNQ4 (refer to Supplementary Fig. 5a), which encodes potassium voltage-gated channel subfamily KQT member 4. Recent research highlights its function in inhibiting the activities of JAK2 and p-AKT, and its role in suppressing the growth of breast cancer69. The peak_334 (refer to Supplementary Fig. 5b) uncovers that SUPT6H binds to PLEK, a gene encoded the Pleckstrin protein which is involved in several processes like G protein-coupled receptor signaling pathway and positive regulation of supramolecular fiber organization.Fig. 6: ChIP-seq analysis of SUPT6H and GTF2E2.a The distribution of binding peaks on the human genome; b The distributions of gene functional elements of peak-associated genes; c GO enrichment analysis for candidate target genes; d KEGG enrichment analysis for candidate target genes. P-values in (c) and (d) are calculated using one-sided Hyper-geometric distribution. No adjustments are made for multiple comparisons. P-value < 0.05 is regarded as statistically significant. Source data are provided as a Source Data file.To further investigate the regulatory influence of the two proteins, the candidate target genes of SUPT6H and GTF2E2 are obtained and then Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis are performed (Supplementary Data 5, 6, Supplementary Figs. 6, 7). The candidate target genes are defined as those whose transcription start site (TSS) is closest to the midpoint of binding peaks. Results showed the candidate target genes of SUPT6H are involved in the regulation of RNA interference, cellular localization, protein complex assembly, and cell morphogenesis involved in differentiation (see the left half of Fig. 6c). These functions are consistent with previous findings that SUPT6H controls cellular differentiation by multiple epigenomic mechanisms70. In addition, the candidate target genes are enriched in cAMP signaling, basal transcription factors, and several metabolism pathways (see the left half of Fig. 6d). GTF2E2 demonstrates the oncogenic role in several cancers63,64,65. Correspondingly, functional enrichment analysis concretely revealed its potential role in multiple cancer progression-associated processes, such as cell motility, cell migration, localization of cell, cell proliferation, and cell fate determination (see the right half of Fig. 6c). Also, its candidate target genes are significantly enriched in curtail cancer-related signaling pathways such as Ras and VEGF signaling (see the right half of Fig. 6d). On the whole, analysis of the downstream functional impact of our identified DBP might support broadly exploring the unknown regulatory process of multiple proteins.Error analysisDespite notable advancements by ESM-DBP in DBP identification, certain limitations persist, especially with a small subset of sequences being inaccurately predicted in the test set. To pave the way for future research and encourage further improvements, we undertook a detailed error analysis. Specifically, we first collated those sequences for which ESM-DBP failed to predict on DBP recognition task, including 30 False Negatives (FNs) and 25 False Positives (FPs) (Supplementary Table 2). Then, to understand why the model judges that the target sequence binds (or does not bind) to DNA, the integrated gradients algorithm is used to identify residues that contribute significantly to the prediction. By incorporating the sequence alignment results from Blast and the domain information recorded in UniProtKB, we found that for FNs, the model focuses those prominent domain features with non-DNA binding potential from the negative samples in training set. That is, if the target DBP to be predicted lacks an intense DNA-binding feature (i.e., it does not possess a conserved DNA-binding domain (DBD)) and contains those irrelevant non-DBDs that are abundant in foundational pretraining set, the model tends to erroneously classify it as a non-DBP. For example, P25294 and P25555 are two DBPs from the test set UniSwiss-Tst. For the P25294 (Protein SIS1), the saliency map reveals that the prediction mainly depends on the J domain and two residues in the C-terminal of protein (see Fig. 7a). Through the sequence alignment by Blast, the original pretraining set of ESM2, i.e., UniRef50, comprises 500 similar sequences with P25294, and 434 out of them (86.8%) comprises the J domain, providing sufficient representative knowledge of this domain for the prediction model. However, the J domain has not been reported as a canonical DBD. Meanwhile, we found 5 negative samples in our training set UniSwiss-Tr containing local regions similar to the J domain and the C-terminal of P25294. Taking this evidence together, we assumed that the model inherits sufficient knowledge regarding the typical regions reflecting negative samples, resulting in the judgment of FNs. Similarly, for P25555 (Serine/arginine (SR)-type shuttling mRNA binding protein GBP2), the erroneous prediction mainly comes from an undue focus on the RRM domain in negative samples of UniSwiss-Tr (see Fig. 7b).Fig. 7: Error analysis.a, b The saliency map, domain information, and MSA coverage of two FNs P25294 and P25555. c, d The saliency map and domain information of two FPs Q95JK0 and A6ZJ55. Taking P25294 as an example, NUniSwiss-Tr-Nega (or NUniSwiss-Tr-Posi) denotes the number of similar sequences of P25294 against the negatives (or positives) of DBP training set UniSwiss-Tr. NJ domain/NUniRef50 indicates that of the 500 similar sequences of P25294 against UniRef50, 434 contain the J domain. “Probability” means the probability that the target protein is predicted to be a DBP. Source data are provided as a Source Data file.For FPs, the model is misguided by representative residues with possible positive effect on DNA-protein interactions. The proteins Q95JK0 and A6ZJ55 are two non-DBPs in the test set, as there is no prior evidence about their binding potential with DNA. Here, we noticed their residues with significant contributions for judgement are mainly located in the 3CxxC-type Zinc finger and Basic and acidic residues regions, respectively (see Fig. 7c, d). Previous studies71,72,73,74,75 have shown that CxxC-type Zinc finger and Basic and acidic residues have a positive effect on DNA-protein interactions. Also, the model inherits sufficient knowledge regarding these typical regions from the original LLM. For Q95JK0, there are 257 similar sequences in the original training set UniRef50, 180 out of them (70.03%) comprise the 3CxxC-type Zinc finger region. For A6ZJ55, there are 10 similar sequences in UniRef50, 8 out of them (80%) comprise the Basic and acidic residues. However, we did not find any sequences similar to these two FPs in the positives of UniSwiss-Tr, even though the threshold of sequence similarity analysis is relaxed (e-value of blast program is turned to 1e-01). We speculate that the PLM may have incorporated deep DNA identification information related to the two motifs during pretraining, which is not easily detected by simple sequence alignment. Whatever, it is still known that the incorrect prediction of FPs mainly comes from a heightened focus on significant motifs that contain potential positive influences to DNA binding.Based on the above studies, a future breakthrough in DBP identification may be to circumvent the over-concentration of the prediction model on interfering motifs or domains that do not primarily determine binding to DNA. That is, the model needs to go further to accurately mine those patterns that actually determine DNA-protein interactions.

Hot Topics

Related Articles