HemoFuse: multi-feature fusion based on multi-head cross-attention for identification of hemolytic peptides

DatasetsIn order to facilitate the comparison with the existing models, we directly use the datasets provided in the benchmark paper, and all positive samples are therapeutic peptides with hemolytic activity33. As shown in Fig. 1, there are four common datasets and an integrated dataset, and Table 1 shows the specific number of each dataset. Dataset 1 and dataset 2 are from Hemolytik and DBAASP v.2 databases, both collected and collated by Chaudhary et al.25, with 1014 and 1623 sequences, respectively. Dataset 3 was collected by Patrick et al.28 from Hemolytik and DBAASP databases with 3738 samples. Dataset 4 is derived from the DBAASP v3 database and has 4339 samples as the dataset used in EnDL-HemoLyt32. The integrated dataset is composed of these four datasets, and CD-HIT34 with a threshold of 0.7 is utilized for de-redundancy processing, resulting in 1993 sequences. These datasets are imbalanced, which better mimics the real-world situation where the positive and negative samples are not equal, so we do not apply an imbalanced treatment to enforce equality. In addition, these datasets are randomly divided into training datasets and independent test datasets according to 8:2.Table 1 Details of the datasets.Figures 2 and 3 show the sequence length distribution and amino acid composition of positive and negative samples from the training sets of datasets 1–4, respectively. As shown in Fig. 2, the lengths of positive and negative samples in dataset 1 are concentrated between 5 and 31 amino acids. In dataset 2, the lengths of positive and negative samples are roughly concentrated between 6 and 38 amino acids, with a few sequences having much longer lengths. The sequence length distribution is more uniform in datasets 3 and 4. Dataset 3 has sequences ranging from 7 to 35 amino acids, while dataset 4 ranges from 6 to 50 amino acids. We use the kpLogo tool35 to analyze the amino acid composition preferences between positive and negative samples, as shown in Fig. 3. In dataset 1, amino acids K and L are more abundant in positive samples, with no significant enrichment in negative samples. In dataset 2, amino acid K is more enriched in positive samples, whereas the negative samples are more varied. In dataset 3, amino acids K and A are highly expressed in positive samples, and amino acids K, R, and L are highly expressed in negative samples. In dataset 4, both positive and negative samples have higher frequencies of amino acids K, R, and L. These results indicate that using only peptide sequence composition for feature extraction is insufficient. Therefore, we will incorporate word embedding techniques into the feature extraction module to uncover the underlying patterns in the sequences.Fig. 2The distribution plot of the sequence lengths.Fig. 3The plot of the amino acid composition.Architecture of HemoFuseAs shown in Fig. 1, our model can be decomposed into three sub-modules: feature extraction and alignment module, feature cross-fusion module, and classification module. We use a combination of advanced word embedding features and traditional hand-crafted features to represent peptide sequences. Token embedding, position embedding, and transformer encoder together form a lightweight language model. BLOSUM62, DDE, DPC, and CKSAAP cover the evolutionary and compositional information of peptide sequences. Hand-crafted feature methods are specifically formulated based on the large number of available peptide sequences and still have great potential in representing peptide sequences. Bi-GRU can mine deeper context features and transform hand-crafted features to the appropriate dimensions to match embedding features. Then, we choose multi-head cross-attention to complete the deep fusion between different features, which is good at capturing the semantic relationship between two related but different sequences. This is the first application in the identification of hemolytic peptides. The classification module mainly consists of CNN and MLP.Feature extraction and alignmentWord embedding featureWord embedding technology follows the distributed semantics hypothesis, which uses the context around each word to express its semantic information36. In general, words with similar contexts will have similar semantic meanings. Compared with hand-crafted features, its biggest advantage is that its parameters can be continuously optimized throughout the training process, making it more suitable for the current data. Popular word embedding models include Word2Vec, GloVe, Bert, etc.In this paper, we adopt token embedding and position embedding to initially represent the shallow information of protein sequences, and then use transformer encoder to capture the bidirectional relationship between amino acids in the sequence more thoroughly37. The architecture is more like a simplified Bert. Token embedding is a vector representation of the amino acid itself, and position embedding encodes the position information of the amino acid into a feature vector, just like RNN or LSTM can provide the position information of the sequence. The dimension of the two embeddings is 128. They are then combined and fed into transformer encoder layer. As shown in Fig. 1, transformer encoder is composed of multi-head attention, layer normalization, feed forword layer, and layer normalization in turn, with two residual connections. This structure does not have too many hidden layers, which can improve the computational efficiency. Self-attention mechanism can re-encode the target feature by the correlation between the target feature and other features, so that new features contain more interaction information without considering their distance in the sequence. The number of heads of self-attention in this module is 4.Hand-crafted featureBLOSUM62 is an amino acid substitution scoring matrix for protein sequence comparison, which stems from the conservation between the same amino acids38. It reflects the evolutionary information of the protein sequence. The score is essentially the logarithm of the ratio of the likelihood of the different amino acids being homologous and non-homologous, and the formula is as follows:$${\text{s}}\left( {a,b} \right)=\frac{1}{\lambda }\log \frac{{{p_{ab}}}}{{{f_a}{f_b}}}$$
(1)
where \({p_{ab}}\) is the frequency of occurrence in the existing homologous sequences assuming that a and b are homologous. \({f_a}\) and \({f_b}\) are the frequencies of residues a and b occurring in either sequence, assuming that a and b are not homologous, respectively. \(\lambda\) is the scaling parameter. If residues a and b are homologous, then \({p_{ab}}>{f_a}{f_b}\) and the score is positive. If residues a and b are not homologous, then \({p_{ab}}<{f_a}{f_b}\) and the score is negative. In summary, the similarity between every two amino acids can be calculated. BLOSUM62 matrix can represent each amino acid as a 20-dimensional feature vector and the protein sequence as a L×20 feature matrix, where L is the sequence length.DDE derives from the difference in dipeptide composition between epitopes and non-epitopes, and uses this to indicate the extent to which dipeptide frequencies deviate from the expected mean39. It is able to analyze the composition and distribution of amino acids in peptide sequences. The feature vector is constructed based on three parameters: dipeptide composition measure (\({D_c}\)), theoretical mean (\({T_m}\)), and theoretical variance (\({T_v}\)). \({T_m}\) does not depend on a specific peptide sequence, so \({T_{m\left( i \right)}}\) of 400 dipeptides is calculated first:$${T_{m\left( i \right)}}=\frac{{{C_{i1}}}}{{{C_{L – 1}}}} \times \frac{{{C_{i2}}}}{{{C_{L – 1}}}}$$
(2)
\({C_{i1}}\) is the codon number of the first amino acid in the dipeptide i and \({C_{i2}}\) is the codon number of the second amino acid. \({C_{L – 1}}\) is the total number of possible codons excluding stop codons. Given a peptide sequence of length L, \({D_{c\left( i \right)}}\) and \({T_{v\left( i \right)}}\) of the dipeptide i are calculated according to the following formula:$${D_{c\left( i \right)}}=\frac{{{n_i}}}{{L – 1}}$$
(3)
$${T_{v\left( i \right)}}=\frac{{{T_{m\left( i \right)}}\left( {1 – {T_{m\left( i \right)}}} \right)}}{{L – 1}}$$
(4)
\({n_i}\) is the frequency of occurrence of the dipeptide i and \(L – 1\) is the number of dipeptides present in this sequence. Thus, DDE of the dipeptide i can be expressed as follows:$$DD{E_{\left( i \right)}}=\frac{{{D_{c\left( i \right)}} – {T_{m\left( i \right)}}}}{{\sqrt {{T_{v\left( i \right)}}} }}$$
(5)
Finally, the peptide sequence can be represented as a 400-dimensional vector:$$DDE=\left\{ {DD{E_{\left( 1 \right)}}, \ldots ,DD{E_{\left( {400} \right)}}} \right\}$$
(6)
DPC is a common feature representation method based on amino acid composition information, which can provide detailed information about the arrangement of amino acids in peptide sequences40. It represents a protein sequence by counting the frequency of occurrence of all dipeptides in the protein sequence. The formula is as follows:$${{\text{f}}_i}=\frac{{{n_i}}}{N}$$
(7)
\({n_i}\) and \({{\text{f}}_i}\) are the number and frequency of occurrences of the dipeptide i in the sequence, respectively. There are a total of 400 dipeptides. Therefore, each protein sequence can be transformed into a 400-dimensional fixed-length feature vector.CKSAAP converts a protein sequence into a feature vector by using the constituent ratio of k-spaced residue pairs in this protein sequence fragment41. It is of great significance to understand the function and structure of proteins. Given a protein sequence of length L and a value of k, two residues separated by a distance of k are extracted and considered as a residue pair, so a total of \(L – k – 1\) residue pairs can be extracted. The probability of occurrence of these residue pairs in the protein sequence is counted, resulting in a 400-dimensional feature vector:$${\left( {\frac{{{L_{AA}}}}{{L – k – 1}},\frac{{{L_{AC}}}}{{L – k – 1}}, \cdots ,\frac{{{L_{YY}}}}{{L – k – 1}}} \right)_{400}}$$
(8)
\({L_{AA}}\), \({L_{AC}}\), and \({L_{YY}}\) is the number of occurrences of the corresponding residue pair. k can be set to 0, 1, 2, 3, 4, 5.Bi-GRU can extract sequence information in proteins and deredundant hand-crafted features42. Both LSTM and GRU are common methods for processing long sequences, and the advantage of GRU is that it uses only two gates, which reduces the number of parameters by nearly a third and effectively avoids overfitting. Bi-GRU does not change its original internal structure, but only applies the model twice in different directions. This ensures that the forward and backward sequence features are captured, resulting in richer features. Notably, it can also perform the task of feature alignment. We set the number of neurons in the hidden layer to 64, resulting in a 128-dimensional feature vector, which is the same as the feature vector output by word embedding module.Feature cross fusionWord embedding features and hand-crafted features have collected a lot of information about the peptide sequence with different rules, so the next step is to consider how to make full use of these features to judge its hemolytic activity43,44,45. These two features contain their own information of interest, and their mutual relationship will be ignored if they are directly combined. In view of this, we adopt multi-head cross-attention mechanism to deeply fuse word embedding features and hand-crafted features, which is often used to deal with multi-modal features46,47. The fused new feature contains the interactive information of the two features. Compared with self-attention mechanism, the input of cross-attention mechanism has two parts: the “where” feature and the “what” feature. The “where” feature acts as the Query (Q), and the “what” feature is used to generate the Key and Value (K, V)48. In this study, we choose word embedding features \({X_{emb}}\) as the “where” feature, and hand-crafted features \({X_{hand}}\) as the “what” feature. The specific operations are as follows:$$Q={W_q}{X_{emb}},K={W_k}{X_{hand}},V={W_v}{X_{hand}}$$
(9)
\({W_q}\), \({W_k}\), and \({W_v}\) are learnable parameter matrices. Then Q and K are used to calculate the correlation between each element of the two inputs to obtain attention weight. Next, update the feature vectors:$${C_n}\left( {{X_{emb}},{X_{hand}}} \right)=Soft\hbox{max} \left( {\frac{{Q{K^T}}}{{\sqrt {{\raise0.7ex\hbox{$D$} \!\mathord{\left/ {\vphantom {D h}}\right.\kern-0pt}\!\lower0.7ex\hbox{$h$}}} }}} \right) * V$$
(10)
where D and h are the embedding dimensions and the number of heads, respectively. Finally, the fused features can be obtained by combining the outputs of multiple attention heads:$$Cross – Attention=Concat\left[ {{C_1}, \cdots ,{C_N}} \right]{W_c}$$
(11)
where N is the number of attention heads and \({W_c}\) is weight matrice. Obviously, in this process, cross-attention mechanism constantly updates the fusion features based on hand-crafted features with the information of word embedding features.ClassificationThe final classifier consists of a CNN layer and four linear layers, and each layer corresponds to a set of a batch normalization layer and a dropout layer to prevent the model from overfitting. This structure can gradually reduce the dimension of the feature vector to avoid information loss.Model evaluationIn this study, we train the model on each of the five datasets and evaluate the performance of the model on the corresponding independent test datasets using the following seven evaluation metrics: accuracy (ACC), specificity (SP), sensitivity (SN), Matthews correlation coefficient (MCC), F1 score, area under ROC curve (AUC) and average precision score (AP)33. The formulas are as follows:$$\begin{gathered} ACC=\frac{{TP+TN}}{{TP+FP+FN+TN}} \hfill \\ SP=\frac{{TN}}{{FP+TN}} \hfill \\ SN=\frac{{TP}}{{TP+FN}} \hfill \\ MCC=\frac{{TP \times TN – FP \times FN}}{{\sqrt {\left( {TP+FN} \right)\left( {TN+FP} \right)\left( {TP+FP} \right)\left( {TN+FN} \right)} }} \hfill \\ F1=\frac{{2 \times TP}}{{2 \times TP+FP+FN}} \hfill \\ \end{gathered}$$
(12)
Where TP, TN, FP, and FN represent correctly predicted positive samples, negative samples, and incorrectly predicted positive samples, negative samples, respectively. ACC represents the proportion of samples that are correctly predicted and reflects the global accuracy of the model. SP and SN are used to measure the ability of the model to correctly predict negative and positive samples, respectively. MCC combines true positives, true negatives, false positives, and false negatives, and is a more balanced index. F1 score is the average of precision and recall. When F1 is high, it also means that both precision and recall are high. The ROC curve can reflect the relationship between the true positive rate and the false positive rate of the model, the closer the curve is to the top left corner, the higher the accuracy of the model. The two ROC curves can be quantitatively compared by calculating AUC values. The precision-recall curve focuses on positive samples and is more suitable for imbalanced datasets. AP is the area under PR curve.

Hot Topics

Related Articles