Post-translational modification prediction via prompt-based fine-tuning of a GPT-2 model

PTMGPT2 implements a prompt-based approach for PTM predictionWe introduce an end-to-end deep learning framework depicted in Fig. 1, utilizing a GPT as the foundational model. Central to our approach is the prompt-based finetuning of the PROTGPT2 model in an unsupervised manner. This is achieved by utilizing informative prompts during training, enabling the model to generate accurate sequence labels. The design of these prompts is a critical aspect of our architecture, as they provide essential instructional input to the pretrained model, guiding its learning process. To enhance the explanatory power of these prompts, we have introduced four custom tokens to the pre-trained tokenizer, expanding its vocabulary size from 50,257 to 50,264. This modification is particularly significant due to the tokenizer’s reliance on the Byte Pair Encoding (BPE) algorithm14. A notable consequence of this approach is that our model goes beyond annotating individual amino acid residues. Instead, it focuses on annotating variable-length protein sequence motifs. This strategy is pivotal as it ensures the preservation of evolutionary biological functionalities, allowing for a more nuanced and biologically relevant interpretation of protein sequences.Fig. 1: Schematic representation of the PTMGPT2 framework.A Preparation of inputs for PTMGPT2, detailing the extraction of protein sequences from Uniprot and the generation of five distinct prompt designs. B Method-specific data preparation process for benchmark, depicting both modified and unmodified sub-sequence extraction, followed by the creation of a training dataset using CD-HIT for 30% sequence similarity C Architecture of the PTMGPT2 model and the training and inference processes. It highlights the integration of custom tokens into the tokenizer, the resizing of token embeddings, and the subsequent prompt design utilized during training and inference to generate predictions.In the PTMGPT2 framework, we employ a prompt structure that incorporates four principal tokens. The first, designated as the ‘SEQUENCE:’ token, represents the specific protein subsequence of interest. The second, known as the ‘LABEL:’ token, indicates whether the subsequence is modified (‘POSITIVE’) or unmodified (‘NEGATIVE’). This token-driven prompt design forms the foundation for the fine-tuning process of the PTMGPT2 model, enabling it to accurately generate labels during inference. A key aspect of this model lies in its architectural foundation, which is based on GPT-215. This architecture is characterized by its exclusive use of decoder layers, with PTMGPT2 utilizing a total of 36 such layers, consistent with the pretrained model. This maintains architectural consistency while fine-tuning for our downstream task of PTM site prediction. Each of these layers is composed of masked self-attention mechanisms16, which ensure that during the training phase, the protein sequence and custom tokens can be influenced only by their preceding tokens in the prompt. This is essential for maintaining the autoregressive property of the model. Such a method is fundamental for our model’s ability to accurately generate labels, as it helps preserve the chronological integrity of biological sequence data and its dependencies with custom tokens, ensuring that the predictions are biologically relevant.A key distinction in our approach lies in the methodology we employed for prompt based fine-tuning during the training and inference phases of PTMGPT2. During the training phase, PTMGPT2 is engaged in an unsupervised learning process. This approach involves feeding the model with input prompts and training it to output the same prompt, thereby facilitating the learning of token relationships and context within the prompts themselves. This process enables the model to generate the next token based on the patterns learned during training between protein subsequences and their corresponding labels. The approach shifts during the inference phase, where the prompts are modified by removing the ‘POSITIVE’ and ‘NEGATIVE’ tokens, effectively turning these prompts into a fill-in-the-blank exercise for the model. This strategic masking triggers PTMGPT2 to generate the labels independently, based on the patterns and associations it learned during the training phase. An essential aspect of our prompt structure is the consistent inclusion of the ‘<startoftext > ’ and ‘<endoftext > ’ tokens. These tokens are integral to our prompts, signifying the beginning and end of the prompt helping the model to contextualize the input more effectively. This interplay of training techniques and strategic prompt structuring enables PTMGPT2 to achieve high prediction accuracy and efficiency. Such an approach sets PTMGPT2 apart as an advanced tool for protein sequence analysis, particularly in predicting PTMs.Effect of prompt design and fine-tuning on PTMGPT2 performanceWe designed five prompts with custom tokens (‘SEQUENCE:’, ‘LABEL:’, ‘POSITIVE’, and ‘NEGATIVE’) to identify the most efficient one for capturing complexity, allowing PTMGPT2 to learn and process specific sequence segments for more meaningful representations. Initially, we crafted a prompt that integrates all custom tokens with a 21-length protein subsequence. Subsequent explorations were conducted with 51-length subsequence and 21-length subsequence split into groups of k-mers, with and without the custom tokens. Considering that the pre-trained model was originally trained solely on protein sequences, we fine-tuned it with prompts both with and without the tokens to ascertain their actual contribution to improving PTM predictions.Upon fine-tuning PTMGPT2 with training datasets for arginine(R) methylation and tyrosine(Y) phosphorylation, it became evident that the prompt containing the 21-length subsequence and the four custom tokens yielded the best results in generating accurate labels, as shown in Table 1. For methylation (R), the MCC, F1 Score, precision, and recall were reported as 80.51, 81.32, 95.14, and 71.01, respectively. Similarly, for phosphorylation (Y), the MCC, F1 Score, precision, and recall were 48.83, 46.98, 30.95, and 97.51, respectively. So, for all the experiments, we used the 21-length sequence with custom tokens. The inclusion of ‘SEQUENCE:’ and ‘LABEL:’ tokens provided clear contextual cues to the model, allowing it to understand the structure of the input and the expected output format. This helped the model differentiate between the sequence data and the classification labels, leading to better learning and prediction accuracy. The 21-length subsequence was an ideal size for the model to capture the necessary information without being too short to miss important context or too long to introduce noise. By framing the task clearly with the ‘SEQUENCE:’ and ‘LABEL:’ tokens, the model faced less ambiguity in generating predictions, which can be particularly beneficial for complex tasks such as PTM site prediction.Table 1 Benchmark results of PTMGPT2 after fine-tuning for optimal prompt selectionComparative benchmark analysis reveals PTMGPT2’s dominanceTo validate PTMGPT2’s performance, benchmarking against a database that encompasses a broad spectrum of experimentally verified PTMs and annotates potential PTMs for all UniProt17 entries was imperative. Accordingly, we chose the DBPTM database18 for its extensive collection of benchmark datasets, tailored for distinct types of PTMs. The inclusion of highly imbalanced datasets from DBPTM proved to be particularly advantageous, as it enabled a precise evaluation of PTMGPT2s ability to identify unmodified amino acid residues. This capability is crucial, considering that the majority of residues in a protein sequence typically remain unmodified. For a thorough assessment, we sourced 19 distinct benchmarking datasets from DBPTM, each containing a minimum of 500 data points corresponding to a specific PTM type.Our comparative analysis underscores PTMGPT2’s capability in predicting a variety of PTMs, marking substantial improvements when benchmarked against established methodologies using the MCC as the metric as shown in Table 2. For instance, in the case of lysine(K) succinylation, Succ-PTMGPT2 achieved a notable 7.94% improvement over LM-SuccSite. In the case of lysine(K) sumoylation, Sumoy-PTMGPT2 surpassed GPS Sumo by 5.91%. The trend continued with N-linked glycosylation on asparagine(N), where N-linked-PTMGPT2 outperformed Musite-Web by 5.62%. RMethyl-PTMGPT2, targeting arginine(R) methylation, surpassed Musite-Web by 12.74%. Even in scenarios with marginal gains, such as lysine(K) acetylation where KAcetyl-PTMGPT2 edged out Musite-web by 0.46%, PTMGPT2 maintained its lead. PTMGPT2 exhibited robust performance for lysine(K) ubiquitination, surpassing Musite-Web by 5.01%. It achieved a 9.08% higher accuracy in predicting O-linked glycosylation on serine(S) and threonine(T) residues. For cysteine(C) S-nitrosylation, the model outperformed PresSNO by 4.09%. In lysine(K) malonylation, PTMGPT2’s accuracy exceeded that of DL-Malosite by 3.25%, and for lysine(K) methylation, it achieved 2.47% higher accuracy than MethylSite. Although PhosphoST-PTMGPT2’s performance in serine-threonine (S, T) phosphorylation prediction was 16.37%, lower than Musite-Web, it excelled in tyrosine(Y) phosphorylation with an accuracy of 48.83%, which was notably higher than Musite-Web’s 40.83% and Capsnet’s 43.85%. In the case of cysteine (C) glutathionylation and lysine (K) glutarylation, GlutathioPTMGPT2 and Glutary-PTMGPT2 exhibited improvements of 7.51% and 6.48% over DeepGSH and ProtTrans-Glutar, respectively. In the case of valine (V) amidation and cysteine (C) s-palmitoylation, Ami-PTMGPT2 and Palm-PTMGPT2 surpassed prAS and CapsNet by 4.78% and 1.56%, respectively. Similarly, in the cases of proline (P) hydroxylation, lysine (K) hydroxylation, and lysine (K) formylation, PTMGPT2 achieved superior performance over CapsNet by 11.02%, 7.58%, and 4.39%, respectively. Collectively, these results demonstrate the significant progress made by PTMGPT2 in advancing the precision of PTM site prediction, thereby solidifying its place as a leading tool in proteomics research.Table 2 Benchmark dataset resultsPTMGPT2 captures sequence-label dependencies through an attention-driven interpretable frameworkTo enable PTMGPT2 to identify critical sequence determinants essential for protein modifications, we designed a framework depicted in Fig. 2A that processes protein sequences to extract attention scores from the model’s last decoder layer. The attention mechanism is pivotal as it selectively weighs the importance of different segments of the input sequence during prediction. Particularly, the extracted attention scores from the final layer provided a granular view of the model’s focus across the input sequence. By aggregating the attention across 20 attention heads (AH) for each position in the sequence, PTMGPT2 revealed which amino acids or motifs the model deemed crucial in relation to the ‘POSITIVE’ token. The Position Specific Probability Matrix (PSPM)19, characterized by rows representing sequence positions and columns indicating amino acids, was a key output of this analysis. It sheds light on the proportional representation of each amino acid in the sequences, as weighted by the attention scores. PTMGPT2 thus offers a refined view of the probabilistic distribution of amino acid occurrences, revealing key patterns and preferences in amino acid positioning.Fig. 2: Attention head analysis of lysine (K) acetylation by PTMGPT2.A Computation of attention scores from the model’s last decoder layer, detailing the process of generating a Position-Specific Probability Matrix (PSPM) for a targeted protein sequence.’SC’ denotes attention scores,’AP’ denotes attention profiles,’AA’ represents an amino acid, and’n’ is the number of amino acids in a subsequence. B Sequence motifs validated by experimentally verified studies—EV161, EV262, EV363, EV447.’AH’ denotes attention head.Motifs K**A*A and C**K were identified in AH 10, while motifs K***K****A, *KH*, and K***K were detected in AH 7. In AH 19, motifs K*C and C*K motifs were observed, and the *GK* motif was found in AH 15. Furthermore, motifs *EK* and KL**ER were identified in AH 5, motifs H***K, D**K, and *FK* were detected in AH 4. The A**K motif was observed in AH13, A*K motif in AH2, and *KM* motif in AH 11. To validate the predictions made by PTMGPT2 for lysine (K) acetylation, as shown in Fig. 2B, we compared these with motifs identified in prior research that has undergone experimental validation. Expanding our analysis to protein kinase domains, we visualized motifs for the CMGC and AGC kinase families, as shown in Fig. 3A, B. Additionally, the motifs for the CAMK kinase family and general protein kinases are shown in Fig. 4A, B, respectively. The CMGC kinase family named after its main members, CDKs (cyclin-dependent kinases), MAPKs (mitogen-activated protein kinases), GSKs (glycogen synthase kinases), and CDK-like kinases is involved in cell cycle regulation, signal transduction, and cellular differentiation20. PTMGPT2 identified the common motif *P*SP* (Proline at positions −2 and +1 from the phosphorylated serine residue) in this family. The AGC kinase family, comprising key serine/threonine protein kinases such as PKA (protein kinase A), PKG (protein kinase G), and PKC (protein kinase C), plays a critical role in regulating metabolism, growth, proliferation, and survival21. The predicted common motif in this family was R**SL (Arginine at position −2 and leucine at position +1 from either a phosphorylated serine or threonine). The CAMK kinase family, which includes key members like CaMK2 and CAMKL, is crucial in signaling pathways related to neurological disorders, cardiac diseases, and other conditions associated with calcium signaling dysregulation22. The common motif identified by PTMGPT2 in CAMK was R**S (Arginine at position −2 from either a phosphorylated serine or threonine). Further analysis of general protein kinases revealed distinct patterns: DMPK kinase exhibited the motif RR*T (Arginine at positions −2 and −3), MAPKAPK kinase followed the R*LS motif (Arginine at position −3 and leucine at position −1), AKT kinase was characterized by the R*RS motif (Arginine at positions −1 and −3), CK1 kinase showed K*K**S/T (Lysine at positions −3 and −5), and CK2 kinase was defined by the SD*E motif (Aspartate at position +1 and glutamate at position +3). These comparisons underscored PTMGPT2’s ability to accurately identify motifs associated with diverse kinase groups and PTM types. PSPM matrices, corresponding to 20 attention heads across all 19 PTM types, are detailed in Supplementary Data 1. These insights are crucial for deciphering the intricate mechanisms underlying protein modifications. Consequently, this analysis, driven by the PTMGPT2 model, forms a core component of our exploration into the contextual relationships between protein sequences and their predictive labels.Fig. 3: Attention head analysis of the CMGC kinase family and the AGC kinase family by PTMGPT2.A Motifs from CMGC kinase family validated against P164 and P265. B AGC kinase family motifs validated against P366 and P467. The’P*SP’ motif is common in the CMGC kinase family, whereas the’R**S’ motif is common in the AGC kinase family.’AH’ denotes attention head.Fig. 4: Attention head analysis of the CAMK kinase family and general protein kinases by PTMGPT2.A CAMK kinase family motifs validated against P568. B General protein kinase motifs validated against P669, P770, and P871. The’R**S’ motif is common in the CAMK kinase family. ’AH’ denotes attention head.Recent uniprot entries validate PTMGPT2’s robust generalization abilitiesTo demonstrate PTMGPT2’s robust predictive capabilities on unseen datasets, we extracted proteins recently released on UniProt, strictly selecting those added after June 1, 2023, to validate the model’s performance. We ensured these proteins were not present in the training or benchmark datasets from DBPTM (version May 2023), which was a crucial step in the validation process. A total of 31 proteins that met our criteria were identified, associated with PTMs such as phosphorylation (S, T, Y), methylation (K), and acetylation (K). The accurate prediction of PTMs in recently identified proteins not only validates the effectiveness of our model but also underscores its potential to advance research in protein biology and PTM site identification. These predictions are pivotal for pinpointing the precise locations and characteristics of modifications within the protein sequences, which are crucial for verifying PTMGPT2’s performance. The predictions for all 31 proteins, along with the ground truth, are detailed in Supplementary Table S1−S5.PTMGPT2 identifies mutation hotspots in phosphosites of TP53, BRAF, and RAF1 genesProtein PTMs play a vital role in regulating protein function. A key aspect of PTMs is their interplay with mutations, particularly near modification sites, where mutations can significantly impact protein function and potentially lead to disease. Previous studies23,24,25 indicate a strong correlation between pathogenic mutations and proximity to phosphoserine sites, with over 70% of PTM-related mutations occurring in phosphorylation regions. Therefore, our study primarily targets phosphoserine sites to provide a more in-depth understanding of PTM-related mutations. This study aims to evaluate PTMGPT2’s ability to identify mutations within 1−8 residues flanking a phosphoserine site, without explicit mutation site annotations during training. For this, we utilized the dbSNP database26, which includes information on human single nucleotide variations linked to both common and clinical mutations. TP5327 is a critical tumor suppressor gene, with mutations in TP53 being among the most prevalent in human cancers. When mutated, TP53 may lose its tumor-suppressing function, leading to uncontrolled cell proliferation. BRAF28 is involved in intracellular signaling critical for cell growth and division. BRAF mutations, especially the V600E mutation, are associated with various cancers such as melanoma, thyroid cancer, and colorectal cancer. RAF129 plays a role in the RAS/MAPK signaling pathway. While RAF1 mutations are less common in cancers compared to BRAF, abnormalities in RAF1 can contribute to oncogenesis and genetic disorders like Noonan syndrome, characterized by developmental abnormalities.PTMGPT2’s analysis of the TP53 gene revealed a complex pattern of phosphosite mutations depicted in Fig. 5A, including G374, K370, and H368, across multiple cancer types25. This is validated by dbSNP data, indicating that 21 of the top 28 mutations with the highest number of adjacent modifications occur in the tumor suppressor protein TP53. The RAF1 gene, a serine/threonine kinase, exhibits numerous mutations, many of which are associated with disrupted MAPK activity due to altered recognition and regulation of PTMs. In our analysis of RAF1 S259 phosphorylation, PTMGPT2 precisely identified mutations directly on S259 and in adjacent hotspots at residues S257, T258, and P261 depicted in Fig. 5B. These findings are consistent with genetic studies29,30 linking RAF1 mutations near S259 to Noonan and LEOPARD Syndrome. Furthermore, in BRAF, another serine/threonine kinase, PTMGPT2’s analysis of the S602 phosphorylation site revealed mutations in flanking residues (1–7 positions) such as D594N, L597Q, V600E, V600G, and K601E23 shown in Fig. 5C. These mutations, particularly those activating BRAF functions, are found in over 60% of melanomas28. Heatmap plots and line plots for remaining genes in dbSNP, and a bar chart depicting the selected genes for analysis, are provided in Supplementary Figs. S1−S19. These results demonstrate PTMGPT2’s proficiency not only in predicting PTM sites but also in identifying potential mutation hotspots around these sites.Fig. 5: PTMGPT2 analysis of mutation distribution around PTM sites.Heatmaps and corresponding line plots illustrate the probability and impact of mutations within the recognition sites of phosphoserines across TP53, RAF1, and BRAF genes. The heatmaps’ X-axes display the wild-type sequence while the Y-axes represent the 20 standard amino acids; yellow indicates the presence of mutations and darker shades indicate their absence. A For TP53 S371 phosphoserine, PTMGPT2 predicts mutations predominantly in the 1−2 flanking residues. The line plot shows the average effect of these mutations by position, and the sequence plot reveals predicted mutation hotspots directly on S371 and 1−2 residues from S371 across multiple human diseases. B Analysis of RAF1 S259 phosphoserine, showing a concentrated mutation effect at the S259 and its immediate vicinity. C For BRAF S602 phosphoserine, PTMGPT2 identifies a broader distribution of mutations within the 1−7 flanking residues, with the line plot indicating significant mutation impacts at positions close to the S602. In the sequence plots,’M’ represents a mutation, and’PS’ indicates a phosphorylated serine residue. Source data are provided as a Source Data file.

Hot Topics

Related Articles