PepNet: an interpretable neural network for anti-inflammatory and antimicrobial peptides prediction using a pre-trained protein language model

Overview of the PepNet frameworkPepNet predicts peptides with anti-inflammatory or antibacterial activity by taking the peptide sequence as input and calculating the probability of anti-inflammatory or antibacterial activity. Its main framework comprises the following four parts (see Fig. 1): (1) extracting diverse peptide features, (2) bi-channel feature encoding via the residual dilated convolution block, (3) residue representation learning by the residual Transformer block, and (4) peptide-wise binary prediction generation.For a given peptide sequence, PepNet extracts the one-hot encoding of the amino acid types, physicochemical properties, and high-dimensional embedding features derived from the pre-trained protein language model34, resulting in a feature matrix X of shape L × D, where L represents the fixed length of the peptide sequence and D denotes the dimension of the extracted features. In cases where the length of the peptide sequence is less than L, zero-padding is employed; conversely, if the length exceeds L, truncation is applied. The physicochemical properties contain eight amino acid indices and six specific amino acid properties. The original features, including the one-hot encoding and the physicochemical properties, are encoded via the residual dilated convolution block to capture the information of multi-order neighbors for each amino acid in the sequence based on the residual dilated convolution block layers. Inspired by the TCN35 block, we construct three dilated convolution layers that progressively expand the receptive field and capture information from the increasingly spaced sequence neighbors. Subsequently, the encoded original features, along with the features derived from the pre-trained protein language models are fed into a residual transformer block for capturing the global sequence information. The transformer encoder and decoder modules can extract information from all positions in the peptide sequence and calculate the dependencies between different positions within the peptide sequence. Finally, the learned sequence features pass through an average pooling, and the representation of the peptide sequence is generated, which is then fed into a multilayer perceptron (MLP) for the classification of peptide activities.Comparison with other leading predictorsIn this section, we compare the performance of PepNet with other state-of-the-art AMP or AIP predictors. The AMP prediction models participating in the comparison include AMPlify25, AMP Scanner Vr.1 (RF Precision)14, AMP Scanner Vr.1 (Earth precision)14, AMP Scanner Vr.214, AMP Scanner Vr.2 (retrained with our data)14, TriNet (retrained with our data)28, and AMP-BERT29. The AIP prediction models compared are AIPStack15, PPTPP (class feature)16, PPTPP (probability feature)16, PPTPP (Fusion feature)16, AIP_MDL30, and TriNet (retrained with our data)28. Furthermore, the Fast version of PepNet without pre-trained features (detailed in the section “Utilization of PepNet via an online web server”) is also compared. Five commonly used evaluation metrics are applied in this study to evaluate the performance of the models, namely Matthews correlation coefficient (MCC), accuracy (ACC), precision, recall, and F1-score. A detailed description and formula for each metric can be found in Supplementary Note 1.Performance comparison on AMP predictionThe results of the performance comparison on the AMP test set are presented in Table 1 and Fig. 2. Based on the comparison results, we find that PepNet exhibits the best performance on the AMP test set, outperforming all other compared methods. The values of accuracy, recall, precision, F1-score, and MCC for PepNet are 0.950, 0.954, 0.947, 0.951, and 0.901, respectively. Furthermore, the improvement rates achieved by PepNet relative to the other methods are: 3.3–22.3%, 4.4–21.2%, 1.6–29.7%, 3.5–21.3%, and 7.3–60.3% in terms of accuracy, recall, precision, F1-score, and MCC. Specifically, the recall, F1-score, MCC value of PepNet are largely improved and achieve 4.4%, 3.5%, and 7.3% higher than the second-best model. Moreover, PepNet is the only model with all five evaluation metrics above 0.9, indicating its strong ability to accurately identify AMPs.Table 1 Performance of the models on the AMP test setFig. 2: Performance comparison on identifying AMPs.This figure displays the performance of PepNet and other compared methods on the AMP test set, where the performance of PepNet is shown in red on the far right.Performance comparison on AIP predictionThe performance of the compared methods on the AIP test set is presented in Table 2 and Fig. 3, which show that the performance of PepNet on the AIP test set is again the best among all the methods that are evaluated. In detail, the values of accuracy, recall, precision, F1-score, and MCC of PepNet are 0.819, 0.940, 0.705, 0.806, and 0.666, respectively. In addition, relative to other compared methods, the improvement rates of PepNet for accuracy, recall, F1-score and MCC are 8.2–30.6%, 28.6–889.5%, 14.0–374.1%, and 33.2–276.3%, respectively. The recall, F1-score, and MCC of PepNet are greatly improved and are respectively 28.6%, 14%, and 33.2% higher than the second-best model, respectively. Although the precision of PepNet is slightly lower than that of PPTPP (Fusion feature) and AIP_MDL, its other metrics are considerably higher, resulting in PepNet achieving the best overall performance. Furthermore, given that recall and precision are two trade-off metrics, PepNet achieves notably higher recall than other algorithms, even with a relatively small decrease in precision, indicating that PepNet is more sensitive in identifying true positives.Table 2 Comparation performance on the AIP test setFig. 3: Performance comparison on identifying AIPs.This figure displays the performance of PepNet and other compared methods on the AIP test set, where the performance of PepNet is shown in red on the far right.In summary, PepNet demonstrates significantly better performance compared to state-of-the-art methods on both AMP and AIP identification. The high values observed in the F1-score and MCC indicate that PepNet achieves a high standard of accuracy and consistency in peptide function prediction.Robustness and generalization ability of PepNetIn addition to the two AMP and AIP datasets, we added five AMP datasets with different activities (antibacterial, antifungal, antiviral, anticancer, and anti-mammalian cells) collected from iAMPCN36 and three AIP datasets (aip_data1, aip_data2, and aip_data3) collected from AIPStack15, BertAIP37, and IF-AIP38 to compare the performance of PepNet with other predictors (see Supplementary Tables 1–8) and demonstrate its robustness and generalization ability. As most of these datasets are unbalanced, the ACC metric is unable to measure the performance of a method and therefore, we removed it in performance comparison. As shown in Supplementary Tables 1–8, PepNet consistently demonstrates the best overall performance on all the added datasets compared to other methods. Additionally, as the proportion of positive and negative samples varies across different datasets, we found that PepNet exhibits better performance in datasets with more balanced positive and negative samples. In particular, for the unbalanced antifungal and anti-mammalian-cells datasets, AMP-BERT29 is unable to learn the attributes of positive samples and all the peptides are predicted as non-AMPs.Considering that many antimicrobial or anti-inflammatory peptides can be toxic, we added a function of predicting toxicity by training the model on a toxic peptide dataset collected from ATSE39 and compared it with ClanTox40, ToxinPred-RF41, ToxinPred-SVM41, ATSE39, and ATSE’s variants (Only-GNN and Only-CNN_BiLSTM). As shown in Supplementary Table 9 and Supplementary Fig. 1, we find that PepNet shows the best overall performance.Ablation studies on PepNetIn this study, PepNet employs a sequence-based feature extraction method and incorporates advanced deep learning techniques to enhance the accuracy and robustness for the predictions for both AMPs and AIPs. In this section, we delve into a detailed analysis of each constituent of the model to elucidate its role and validate its contribution, particularly focusing on components responsible for feature extraction and sequence information processing. Through a series of ablation experiments, we systematically investigate the impact of altering individual model components or hyperparameters, adhering to a methodology where only one component or hyperparameter is modified at a time. According to the architecture of the PepNet framework, the ablation experiments examine the impact of feature selection, the contribution of the residual dilated convolution block, the contribution of the residual Transformer block, and the approach of pooling amino acid features into a peptide feature.Impact of feature selection on PepNetTo investigate the contribution of features to PepNet, we trained and tested PepNet by removing one-hot features, physicochemical property features, and pre-trained features, respectively. As illustrated in Fig. 4 (see detailed results in Supplementary Tables 10, 11), the exclusion of each of the three distinct features clearly influences the performance of PepNet, leading to a reduction of 1.9–4.5% and 3.7–8.7% in F1-score and MCC on the AMP test set, and 7.7–10.2% and 17.7–21.5% in F1-score and MCC on the AIP test set. Specifically, the pre-trained features, which are typically obtained by a protein language model trained on large-scale protein datasets, show the highest contribution to PepNet on both the AMP and AIP test sets and the removal of this feature results in large decreases of 4.5% and 10.2% in F1-score, and 8.7% and 21.5% in MCC, respectively on AMP and AIP test sets, compared to using all features. Notably, the physicochemical properties also contribute significantly to PepNet on the AIP test set, resulting in a decrease of 9.9% in F1-score and 21.2% in MCC after removing it.Fig. 4: Results of the ablation experiments.This figure displays the performance of PepNet under different ablation experiments in terms of accuracy (A), recall (B), precision (C), F1-score (D), and MCC (E) on the AMP and AIP test sets. In each figure, the letters A1–A4 represent feature ablation experiments of PepNet by using the amino acid type one-hot encoding, amino acid physicochemical properties, pre-trained features derived from the large protein language model, and the combination of them; B1–B5 represent the residual dilated convolution block ablation experiments of PepNet by removing the residual dilated convolution block, removing the residual connection operation within the block, and substituting the dilated convolution layer with Bi-LSTM, LSTM, and GRU layers, respectively, and B6 represents the residual dilated convolution block applied by PepNet; C1–C6 represent the residual Transformer block ablation experiments of PepNet by removing the residual dilated convolution block, removing the residual connection operation within the block, and with 1–4 Transformer layers in the block; D1 and D2 represent the maximum and average pooling strategies on PepNet.Contribution of the residual dilated convolution block to PepNetThe residual dilated convolution block is responsible for capturing spaced neighbors information in peptide sequences and is a key component for dynamically understanding the distribution of amino acids in a sequence. To explore the impact of the residual dilated convolution block on PepNet, we conducted experiments by altering its architecture as follows: (1) removing the residual dilated convolution block entirely, (2) removing the residual connection operation within the block, and (3) substituting the dilated convolution layer with Bi-LSTM, LSTM, or GRU, respectively. As illustrated in Fig. 4 (see detailed results in Supplementary Tables 10-11), the exclusion of the residual dilated convolution block has a great impact on PepNet, leading to a performance reduction of 4.9% and 9.2% in F1-score and MCC on the AMP test set, and 10.5% and 24.2% on the AIP test set. However, the removal of residual connection leads to a decrease of 1.7% and 3.0% in F1-score and MCC on the AMP test set, and 6.3% and 14.4% on the AIP test set. Specifically, when replacing the dilated convolution with Bi-LSTM, LSTM, or GRU, the performance decreases by 3.9–4.0% and 7.1–8.1% in F1-score and MCC on the AMP test set, and 7.0–8.8% and 16.4–19.8% on the AIP test set, indicating that the dilated convolution layers effectively capture global spaced-neighbor information, which is important for identifying AMPs and AIPs.Contribution of the residual Transformer block to PepNetThe residual Transformer block is tasked with attending to key positional amino acid information while also capturing comprehensive positional details throughout the whole peptide sequence. In order to investigate the influence of the residual Transformer block on PepNet, we also conducted experiments by excluding the residual Transformer block, removing the residual connection operation within the block, and changing the hyperparameter for the number of Transformer layers in the block. As shown in Fig. 4 (see detailed results in Supplementary Tables 10, 11), removing the residual Transformer block results in a notable degradation of PepNet’s performance, particularly observed in the AIP test set, where both F1-score and MCC demonstrate substantial declines of up to 21.7% and 54.8%, respectively. It is noteworthy that during the experiments involving variations in the hyperparameters of Transformer layers, we found that the performance of PepNet declines as the number of layers increases. This phenomenon may be attributed to the increased model complexity caused by the increased number of layers, potentially leading to overfitting or inadequate training to effectively support deeper network architectures. Although this effect is less pronounced on the larger AMP training set, in contrast to that of AIP, it remains present. These findings indicate that when designing Transformer-based models, the choice of the number of layers must be carefully calibrated to the specific task and data characteristics to avoid unnecessary complexity and performance loss.Pooling operations of amino acid featuresThe pooling operation is employed to downscale and extract crucial features, which represents a pivotal step in the generation of the final sequence representation. Maximum pooling tends to concentrate more on capturing the most salient signals, while average pooling offers a more comprehensive vector of sequence features. To evaluate the impact of different pooling strategies on PepNet, we replaced average pooling with maximum pooling. As shown in Fig. 4 (see detailed results in Supplementary Tables 10, 11), maximum pooling results in a 2.6% decrease in F1-score and a 5.5% decrease in MCC on the AMP test set, along with an 11.2% decrease in F1-score and a 24.5% decrease in MCC on the AIP test set for PepNet. This indicates that solely focusing on the maximum features of all amino acids fails to adequately characterize the entire peptide sequence, leading to large information loss.Interpretability of the PepNet modelTo deeply understand the learning mechanism of PepNet in the detection of antibacterial and anti-inflammatory activities, we explore what it learns in different ways. For instance, what are the contributions of the pre-trained features to the classification? What does the residual dilated convolution block learn? What does the residual Transformer block learn? In this study, we apply the t-SNE, a machine learning algorithm commonly used for dimensionality reduction and visualization of high-dimensional data in a lower-dimensional space, to perform a detailed visualization of the high-dimensional feature representations learned by PepNet. By projecting the learned peptide representation learned by PepNet into a reduced-dimensional space, it is easy to find the similarity between positive or negative samples and the degree of their distinguishability. Moreover, we illustrate the interpretability of the PepNet model by exploring whether it is perceiving cationic and amphiphilic properties in AMPs. The visualization results are presented in Fig. 5 and Supplementary Fig. 2.Fig. 5: Visualization of each learning state of PepNet on the AMP test set.This figure displays the visualization of the 2D t-SNE projections of the original features (A), pre-trained features (B), processed features by the residual dilated convolution block (C), the input (D) and the output (E) features of the residual Transformer block.The original and the pre-trained features influence the learning process in different mannersThe original features, containing the one-hot encoding of the amino acid type and the physicochemical properties, are strongly related to the activities of a peptide, while the pre-trained features derived from a large protein language model are much richer, more informative, and more generalized. According to the 2D t-SNE projections of the original features and pre-trained features on the AMP test set (Fig. 5A, B), the separation of AMPs (red) and non-AMPs (blue) is clearer under the pre-trained features compared to the original features, indicating the important role of the pre-trained features in identifying AMPs.The residual dilated convolution block is learning the spaced neighboring informationThe residual dilated convolution block starts learning with the original features as input and outputs the spaced neighboring information in peptide sequences. By comparing the t-SNE scatter plots of the input and output features of the residual dilated convolution block (Fig. 5A, C), it is clearly observed that the boundaries between positive and negative samples are blurred in the unprocessed original features, whereas the clustering of the samples improves after the residual dilated convolution block. However, the boundary between the two categories is still not very clear. Visualization of the data before and after the residual dilated convolution block demonstrates a reduction in category overlap, indicating that the residual dilated convolution block is able to capture the key characteristics from the original features by aggregating the spaced-neighboring information, which is effective for AMP detection.The residual Transformer block is learning the global information in the peptideThe residual Transformer block takes the output of the residual convolution block and the embedding of the pre-trained features as input, and produces global information in the peptide sequence as output. By comparing the t-SNE scatter plots of the input and output features of the residual Transformer block (Fig. 5D, E), a clear cluster segmentation of positive and negative samples within the AMP test set is evident. The clear distinction between the two categories of samples suggests that the representations learned by the residual Transformer block exhibit different characteristics among positive and negative samples and that the residual Transformer block significantly improves the recognition of AMPs by capturing comprehensive positional information across the peptide sequence.The feature visualization analysis of each state of PepNet on the AIP test set also reveals a similar phenomenon (Supplementary Fig. 2). The t-SNE visualization results on the two test sets collectively substantiate the significance and influence of the various components in PepNet. This multi-stage feature visualization analysis not only deepens our understanding of the working mechanism of the PepNet model but also points the way to further model optimization and application practice.PepNet perceives cationic and amphiphilic properties in AMPsCationic amphiphilic sequences often adopt an α-helical structure in hydrophobic environments such as cell membranes. These sequences are crucial components in many biologically active peptides due to their ability to interact with and disrupt biological membranes, making them valuable in antimicrobial therapies. In this study, we use the charge at pH 7.0 and the grand average of hydropathy (GRAVY) to measure the cationic and amphiphilic properties of peptides. The charge and GRAVY distributions in the AMP and AIP training sets are shown in Fig. 6A. Since PepNet is a data-driven deep learning model, it learns the distributions of training sets to predict the distributions of test sets. Therefore, whether these characteristics are perceived by the model depends on the distributions of the two characteristics on the training sets. Based on the AMP and AIP datasets used in this study, we illustrate the true distributions of cationic and amphiphilic properties in the AMP and AIP test sets (see Fig. 6B) and their distributions as predicted by PepNet (see Fig. 6C). It can be observed that antimicrobial peptides contain more cationic amphiphilic sequences than non-antimicrobial peptides, especially the cationic sequences, which is not applicable to the AIP dataset. Additionally, we find that PepNet accurately perceives the distributions of positive and negative samples for both the charge and GRAVY. Moreover, we visualize the final representation learned by PepNet using the t-SNE tool, colored with charge and GRAVY scores (see Fig. 6D), which clearly shows that PepNet perceives the cationic properties of peptides in the AMP dataset. Due to the slight difference in GRAVY distributions between AMPs and non-AMPs, it is hard to directly discriminate whether PepNet perceives the amphiphilic properties from the t-SNE visualization, which again coincides with the visualization result. Moreover, we can observe from Fig. 6D that the differences in both the charge and GRAVY distributions between AMPs and non-AMPs are much larger than those between AIPs and non-AIPs, which indicates that both the cationic and amphiphilic properties contribute more to antimicrobial than to anti-inflammatory properties.Fig. 6: Boxplot of charge and GRAVY distributions in AMP and AIP data.A The true distributions of positive and negative samples in AMP and AIP training sets. B The true distribution of positive and negative samples in AMP and AIP test sets. C The predicted distributions of positive and negative samples by PepNet in AMP and AIP test sets. D The illustration of the final representation learned by PepNet, with points colored by positive and negative scores of charge and GRAVY.Utilization of PepNet via an online web serverFor the convenience of users in using our PepNet tool, we develop a user-friendly web server for online prediction of peptide sequences as antimicrobial peptides (AMPs) or anti-inflammatory peptides (AIPs). Based on different user requirements, we introduce two running modes: a Fast and a Standard version of the interface (Fig. 7A). The performance of these modes can be seen in Tables 1 and  2. The Fast version utilizes a trained model that does not rely on pre-trained features, facilitating quick predictions. Users can simply upload a FASTA file containing multiple peptide sequences, select the desired model type (AMP or AIP), and obtain predictions promptly. Conversely, due to our limited equipment, the Standard version requires users to submit a FASTA file alongside the corresponding pre-trained feature storage file, generated in H5PY format by ProtT5-XL-U50. Upon submission, the application generates a prediction result page (Fig. 7B) where users can view the prediction outcome for each submitted peptide sequence, including the peptide sequence, the predicted score, and the classification result. Additionally, users have the option to download the result file for further analysis. This web server provides a convenient and efficient platform for researchers to predict the antimicrobial or anti-inflammatory activity of peptide sequences. In addition, we provide an online web server for toxicity prediction.Fig. 7: Web server of PepNet.A The interface of the online web server of PepNet. B The result interface of the application displays a result table containing the peptide sequence, the predicted probability, and the binary classification result. The rows highlighted in red indicate peptides predicted as positive. In addition, users can download the result file from the top of the table.

Hot Topics

Related Articles