A deep learning model for anti-inflammatory peptides identification based on deep variational autoencoder and contrastive learning

Data preparationData collectionA good dataset can provide rich and diverse samples for the deep learning model, so that the model can learn more generalized features and thereby improve its performance. In this study, we use the same dataset as IF-AIP36, which collect the datasets from iAIPs25 and AntiInflam17, resulting in two benchmark datasets. We merge the two benchmark datasets into one dataset comprising 1948 positive samples and 2896 negative samples. Subsequently, the CD-HIT37 tool is applied to remove redundancy (threshold set to 0.9), resulting in a benchmark dataset with 1450 positive samples and 2339 negative samples. Furthermore, independent datasets from AntiInflam17 and iAIPs25 are collected, consisting of 173 positive samples and 252 negative samples, and 420 positive samples and 629 negative samples respectively. The final model is tested using the two independent datasets. For detailed information about the datasets, please refer to Table 1.Table 1 Details of benchmark dataset and independent datasets.Data augmentationIn deep learning, having a sufficient amount of training data is often crucial for achieving better performance, and insufficient training data is a common issue. One effective approach to address the problem of insufficient data is data augmentation, which increases the amount of data available for training. In this study, we apply three different data augmentation methods to the training data, namely random interception, sequence reversing, and noise addition to peptide sequences. These methods aim to increase the volume of training data while reducing overfitting and enhancing the robustness of the model. Specifically, random interception involves randomly selecting a starting position and length within the peptide sequence, then extracting this subsequence from the original sequence as a new sample. This helps increase data diversity, allowing the model to learn features from different lengths and positions during training. Sequence reversing simply reverses the order of the peptide sequence. By reversing the sequence, new sequence features are introduced, helping the model better understand specific sequence patterns and structures within the peptide chain. The noise addition operation randomly replaces some amino acids within the sequence. This method aids in training the model’s robustness to noise and variations, reducing the risk of overfitting while increasing data diversity and coverage. It is important to note that data augmentation does not change the sample’s class label. Positive samples, after undergoing data augmentation, are still considered positive samples. The combination of these three data augmentation methods can significantly improve the model’s generalization ability and performance in peptide sequence analysis tasks, enabling it to better handle peptide data of different lengths, structures, and noise levels.Model overviewDAC-AIPs is mainly constructed based on deep variational autoencoder (VAE) and contrastive learning. In the sequence encoding part, we improve the traditional one-hot encoding by incorporating multi-hot features to enhance the ability of capturing sequential information. In this study, the deep VAE consists of two parts: an encoder and a decoder. The encoder comprises convolutional layers, max-pooling layers, and linear layers, while the decoder consists of linear layers, upsampling layers, and transposed convolutional layers. The VAE maps input data to a latent space, samples data points from the latent space, and then reconstructs them through the decoder. The latent features output by the encoder are fed into the output layer for classification. On the other hand, we extract latent features from different category data and compute contrastive loss. In the process of contrastive learning, the goal is to reduce the distance between samples of the same label and increase the distance between samples of different labels by minimizing the loss, thereby improving classification performance. The framework of the complete model is illustrated in Fig. 1.Figure 1The architecture of DAC-AIPs.Sequence encodingOne-hot encoding is widely used in machine learning for feature representation38,39,40. It converts peptide sequences into numerical vectors acceptable by models through binary encoding. For a peptide sequence of length L, each amino acid in the sequence is represented by a 20-dimensional binary vector. These 20 dimensions represent the sequential arrangement of the 20 types of amino acids, with only the element corresponding to the amino acid being 1, and all other elements being 0. However, one-hot encoding suffers from sparsity. More importantly, each amino acid vector in the encoded sequence is independent of each other, meaning it cannot capture the sequence’s order and positional information. Therefore, we simultaneously employ the multi-hot method for encoding. Multi-hot first divides the sequence into continuous “words” using a sliding window of size k, and then encodes each word. In this encoding vector, all positions corresponding to amino acids in the word are set to 1. For example, for a sequence R = ‘AYCDW’, when k = 3, its multi-hot encoding vector is shown in Formulas (1–3).$$ R_{k = 3} = [{\text{`}}AYC{\text{‘}}, {\text{`}}YCD{\text{‘}}, {\text{`}}CDW{\text{‘}}] $$
(1)
$$ V_{one\_hot} = [[1,0,…,0],[0,…,0,1],[0,1,…,0],[0,0,1,…,0],[0,…,1,0]] $$
(2)
$$ V_{multi\_hot(k = 3)} = [[1,1,0,…,0,1],[0,1,1…,0,1],[0,1,1,…,1,0]] $$
(3)
By including the amino acids before and after in each word, the encoding vector captures positional information to enhance its representation capability of the original sequence. In this study, we set the value of k in multi-hot encoding to 1 (equivalent to one-hot encoding), 2, and 3, respectively. Then, we concatenate the obtained encoding vectors to form the final sequence encoding feature.Deep variational autoencoderEncoder-decoderAutoencoder41,42 is a type of neural network based on unsupervised learning, aiming to reconstruct input samples that have been compressed through parameter adjustments. The transformation from the input layer to the intermediary layer is denoted as encoding, whereas the transformation from the intermediary layer to the output layer is labeled as decoding. Typically, an autoencoder first obtains a compressed vector through encoding and then reconstructs it through decoding. In this study, the encoder part consists of sequential layers including two convolutional layers, two max-pooling layers, and two linear layers. The last linear layer outputs the distribution of the latent space (mean μ and variance σ), from which sampling is performed, and the sampled vector is input into the decoder. The decoder’s layers are ordered as two linear layers, two upsampling layers, and two transposed convolutional layers. Finally, the decoder outputs the reconstruction vector, and the reconstruction loss is calculated together with the input vector. An intuitive model structure is illustrated in Fig. 1C.Autoencoders can learn compact representations of data, which can serve as the basis for feature learning and assist in subsequent supervised learning tasks. Through self-supervised training, autoencoders can obtain a latent feature encoding from the original features, achieving automated feature engineering and the goals of dimensionality reduction and generalization.VariationAutoencoders typically do not constrain the structure of the latent space, thus learned representations may be discontinuous, which can lead to generated data being incoherent or not conforming to the data distribution. Variational Autoencoder (VAE)43 utilize variational inference techniques to constrain the distribution of the latent space, making the learned latent representations more continuous and aiding in generating coherent data.In VAE, the hidden layer obtained through neural network encoding is assumed to follow a standard Gaussian distribution, outputting the mean (μ) and variance (σ) of a multi-dimensional Gaussian distribution. Subsequently, a feature z is sampled from this distribution, which is then used for decoding, aiming to reconstruct results similar to the original input. The structure of the latent space is normalized by constraining the distribution returned by the encoder to approximate a standard Gaussian.Specifically, variational inference considers a Bayesian problem: given observed variables \(x \in {\mathbb{R}}^{k}\) and latent variables \(z \in {\mathbb{R}}^{d}\), their joint probability distribution is given by Formula (4)44$$ P(z,x) = P(z)P(x|z) $$
(4)
The posterior distribution \(P(z|x)\) can be represented by Formula (5).$$ P(z|x) = \frac{P(x|z)P(z)}{{P(x)}} = \int_{z} {\frac{P(x|z)P(z)}{{P(x)}}dz} $$
(5)
Assume a variational distribution \(Q(z)\) comes from the distribution family Q, and minimize the Kullback–Leibler (KL) divergence to make it closer to the posterior distribution \(P(z|x)\).$$ Q^{*} = argmin_{Q(z) \in Q} KL(Q(z)||P(z|x)) $$
(6)
Substituting Formula (5) into Formula (6) and utilizing Bayes rule and the KL rule, we obtain$$ KL(Q(z)||P(z|x)) = E_{Q(z)} [\log Q(z) – \log P(x|z) – \log P(z)] + \log P(x) $$
(7)
As \(\log P(x)\) is constant, we can obtain Formula (8) by keeping only the optimization terms.$$ Q^{*} = argmin{\kern 1pt} {\kern 1pt} {\kern 1pt} E_{Q(z)} [ – \log P(x|z)] + KL(Q(x)||P(z)) $$
(8)
At this point, for the encoder part, assuming \(P(z)\) follows a Gaussian distribution \({{\mathcal{N}}}(0,1)\), we aim to fit the distribution \(Q(z) = {{\mathcal{N}}}(\mu ,\sigma )\) as closely as possible to \(p(z) = {{\mathcal{N}}}(0,1)\), meaning the KL term in Formula (8) should be minimized. The parameters μ and σ are fitted by the output of the last layer of the encoder, and the KL divergence is computed by μ and σ.$$ KL(Q(x)||P(z)) = – \frac{1}{2} \times [\log \sigma^{2} + 1 – \sigma^{2} – \mu^{2} ] $$
(9)
For the decoder part, assuming \(P(x|z)\) follows a Bernoulli (p) distribution, then$$ argmin{\kern 1pt} {\kern 1pt} {\kern 1pt} ( – \log P(x|z)) = argmin( – x\log p – (1 – x)\log (1 – p)) $$
(10)
which is exactly equivalent to the cross-entropy loss. Therefore, the KL divergence in Formula (9) and the cross-entropy loss (reconstruction loss) in Formula (10) jointly form the loss function of the variational autoencoder, where they are assigned different weights based on the training scenario.$$ Loss_{AVE} = Loss_{{{\text{Re}} c}} + KL = \alpha \times Cross\_entropy(x,x_{{{\text{Re}} c}} ) + \beta \times KL(Q(x)||P(z)) $$
(11)
where xRec represent the reconstruction vectors.Contrastive learningContrastive learning allows models to extract meaningful representations from unlabeled data. By leveraging similarities and dissimilarities, contrastive learning enables models to map similar instances closely together in the latent space while separating different instances45. This approach has been proven effective in various domains such as computer vision, natural language processing (NLP), and reinforcement learning.The fundamental idea of contrastive learning is to learn compact representations of data by maximizing the similarity between similar samples and minimizing the similarity between dissimilar samples. In this study, anchor samples (samples under test), positive samples (samples of the same class as anchor samples), and negative samples (samples of different classes from anchor samples) are simultaneously input into the deep VAE for training. Contrastive loss is computed using the latent features extracted by the encoder corresponding to the three samples. In this study, triplet loss is utilized as the contrastive loss. By calculating the distance between features, the triplet loss function aims to reduce the feature distance between samples of the same class, while simultaneously enlarging the feature distance between samples of different classes. In addition, set a marginal distance, which is the minimum distance between samples of different classes, to prevent features from being compressed into a small space. The specific calculation is illustrated in Formula (12).$$ Loss_{CL} = \max \{ 0,d(f_{a} ,f_{p} ) – d(f_{a} ,f_{n} ) + margin\} $$
(12)
where fa, fp, and fn represent the latent features corresponding to anchor, positive, and negative samples, respectively, and d denotes the Euclidean distance.Output layerThe latent features extracted by the encoder are passed through the final output layer, comprising two fully connected layers. Predicted probability values are then acquired using the softmax function. Finally, the classification loss (LossCla), along with the other aforementioned losses, forms the complete loss function. During the model training process, parameters are continuously optimized and losses are minimized to enhance the model’s performance.$$ Loss = \alpha \times Loss_{Rec} + \beta \times KL + \gamma \times Loss_{CL} + Loss_{Cla} $$
(13)
where α, β, and γ represent the weights assigned to the corresponding losses.Performance evaluation metricsTo evaluate the performance of the model, we use five commonly used evaluation metrics: Accuracy (ACC), Sensitivity (Sn), Specificity (Sp), Area Under Curve (AUC), and Matthews Correlation Coefficient (MCC)46,47,48,49,50. They respectively measure the proportion of correctly classified samples, the ability to correctly identify positives, the ability to correctly identify negatives, the performance of the model at different classification thresholds, and the overall performance of the model. With the exception of MCC, their values fall within the 0 to 1 range, with higher values indicating superior model performance. MCC, on the other hand, spans from -1 to + 1, where a score of + 1 signifies flawless prediction, 0 denotes random prediction, and − 1 signifies entirely contradictory prediction. The specific formulas for all metrics are shown in Formula (14).$$ \left\{ \begin{aligned} ACC & = \frac{TP + TN}{{TP + TN + FP + FN}} \hfill \\ Sn &= \frac{TP}{{TP + FN}} \hfill \\ Sp &= \frac{TN}{{TN + FP}} \hfill \\ AUC &= \frac{{\sum\limits_{i}^{{n_{pos} }} {rank_{i} – \frac{{n_{pos} (n_{pos} + 1)}}{2}} }}{{n_{pos} n_{neg} }} \hfill \\ MCC & = \frac{TP \times TN + FP \times FN}{{\sqrt {(TP + FP)(TP + FN)(TN + FN)(TN + FP)} }} \hfill \\ \end{aligned} \right. $$
(14)
where TP, TN, FP, and FN correspondingly denote true positives, true negatives, false positives, and false negatives.

Hot Topics

Related Articles