CBIL-VHPLI: a model for predicting viral-host protein-lncRNA interactions based on machine learning and transfer learning

This study aimed to develop a machine learning model for predicting protein‒RNA interactions. To achieve this goal, numerous data processing and modeling steps were carried out. The overall design flow of the model is shown in Fig. 1.Figure 1The workflow of the CBIL‒VHPLI model. First, a pretrained CBIL‒VHPLI model was developed using the RPI18072 dataset to predict multispecies protein‒lncRNA interactions. Then, the vhRPI286 dataset was used to fine-tune the model, enabling predictions of viral–host protein–lncRNA interactions.Dataset sourcesFive datasets were used in this paper: one dataset, the RPI18072 dataset, to pretrain the CBIL‒VHPLI model (abbreviated as Pre_CBIL‒VHPLI), three datasets, the RPI2241, RPI1807, and RPI488 datasets, for fivefold cross–validation (5CV) on Pre_CBIL‒VHPLI, and one dataset, the vhRPI286 dataset, for fine-tuning Pre_CBIL‒VHPLI to obtain CBIL‒VHPLI. An overview of these five datasets is shown in Table 1.Table 1 Numeric description of the datasets.The RPI18072 dataset was used for model pretraining. Long human noncoding RNA‒protein interaction data were extracted from the NPInter v2.0 database21 and the lncRNome database22. The corresponding information of the long noncoding RNA sequence was extracted from the NONCODE v3.0 database, and the corresponding protein sequence information was extracted from the UniProt database. The overlapping information contained in the two databases was removed. Considering that our model is based only on lncRNA and protein sequences, and that high sequence similarity may lead to biased results for machine learning approaches, we performed redundant sequence elimination on all datasets (except for the publicly available datasets RPI2241, RPI1807, and RPI488 datasets) using the CD-HIT package23 with a threshold of 0.9. The dataset downloaded from the NPInter v2.0 database contained 2780 positive samples (reciprocal pairs), with 2725 pairs remaining after removing the redundant and incomplete data. The dataset downloaded from the lncRNome database was also processed and filtered to yield 8110 positive samples. The lncRNA‒protein interaction data obtained from these two databases were combined and once again subjected to redundancy removal, resulting in 9036 positive samples. Finally, negative samples, i.e., negative data, were generated by conducting random matching. To ensure that the data were balanced, the amount of negative sample data generated by random pairing was the same as the amount of positive sample data. Thus, we constructed the nonredundant RPI8072 dataset as a pretraining dataset for the model, and the final 20% of the data were used a pre- and posttraining test set. Three datasets, the RPI224124, RPI180725, and RPI48826 datasets, were obtained from published papers and used for 5CV purposes after pretraining. These three lncRNA-protein interaction datasets are more reliable and were built by calculating the atomic distances between the RNAs and proteins in an RNA‒protein complexes from the Protein Data Bank (PDB)27.We obtained viral protein and lncRNA interaction data from the RAID v2.028, RNAInter v4.029, and VirBase v3.030 databases and augmented the fine-tuned dataset via the random pairing method to make the numbers of positive and negative samples more balanced. The fine-tuning dataset (vhRPI286) was constructed with 87 protein sequences and 123 lncRNA sequences.Overall, our dataset consisted of 30,142 samples, including 15,270 positive samples and 14,872 negative samples, covered wide ranges of species and types of lncRNAs and proteins, and provided a diverse and comprehensive dataset for subsequent analyses.The zero-padding and cropping tricksSince the CNN model requires fixed-length sequences as inputs and different lncRNA and protein sequences have significantly different lengths, we treated the sequences as fixed-length sequences. Currently, two tricks, zero-padding and cropping, are often used to convert a variable-length sequence into a fixed-length sequence. The zero-padding trick31 generates fixed-length sequences by simply supplementing zero values at two ends or one end of each sequence until the lengths of all sequences are equal to that of the longest sequence contained in the training and testing datasets. The cropping tricks involves cutting a long sequence into a fixed-length sequence32. Since some of the lncRNA and protein sequences were too long or too short, we set the average lengths of the lncRNA sequences and protein sequences separately. In this study, we set lncRNA_max_length = 2500 nt and protein_max_length = 1000 aa because approximately 90% of the lncRNAs were ≤ 2500 nt in length and approximately 90% of the proteins were ≤ 1000 aa in length. After we set the maximum length thresholds for the lncRNA and protein sequences, we converted these sequences into fixed-length sequences. The utilized method encoded the textual information of the sequences through the pad_sequences function, which transformed the variable-length sequences into fixed-length feature matrices as inputs for the subsequent feature extraction and model learning processes. For lncRNA and protein sequences, when the length was greater than the set fixed length, the sequence was trimmed to a fixed length, and when the sequence was shorter, it was extended to a fixed length with 0. Finally, the populated sequences were passed to the model for training or testing.One-hot encoding method for encoding lncRNA and protein sequencesOne-hot methods are currently very common approaches for encoding RNA and protein sequences. Recent studies have shown that considering the dependencies between nucleotides or amino acids can improve the performance of predictors. Here, we also adopted the high-order one-hot encoding to transform the lncRNA and protein sequences into matrices.For a given lncRNA sequence \(S={N}_{1}{N}_{2}\cdots {N}_{Lnc}\) with \({L}_{lnc}\) nucleotides, the one-hot encoding matrix R of this sequence can be formulated as33:$$R(i, j)=\left\{\begin{array}{c}1 if {N}_{j}\cdots {N}_{j-1+k}={i}^{th} base in\left\{{4}^{k} k-mer nucleotides\right\}\\ 0 otherwise \end{array}\right., j\in \left[\text{1,2},\cdots {L}_{lnc}\right]$$
(1)
where \(i\in \left[\text{1,2}\cdots ,{4}^{k}\right]\) is the index of k-mer nucleotides, and \(k\) denotes the order of \(k-mer\text{ nucleotides}\). In this paper, we set \(k=4\), and each lncRNA sequence was converted into an \({4}^{k}\times max{L}_{lnc}\) numerical matrix.The protein sequence was composed of 20 kinds of amino acids. For \({\text{P}}_{\text{mer}}\), the amino acids were divided into seven groups according to the dipole moment and side-chain volumes of the proteins34. Each group is represented as an alphabetic symbol \({F}_{f}(f=1, 2, \cdots ,7\)). For instance, \(F_{1} = \left\{ {A, G, V} \right\},F_{2} = \{ I, L, F, P\} , F_{3} = \left\{ {Y, M, T, S} \right\}\), \(F_{4} = \left\{ {H, N, Q, W} \right\},F_{5} = \left\{ {R, K} \right\},F_{6} = \left\{ {D, E} \right\}\), and \({F}_{7}=\{C\}\). Thus, a protein sequence with 20 amino acid symbols could be reduced to a sequence with 7 symbols, i.e., \(Q ={p}_{1}, {p}_{2}\cdots {p}_{l}\cdots {p}_{L}, {p}_{l}\in {F}_{1}, {F}_{2, },\cdots {F}_{7}\). Then, the one-hot encoding matrix P of a protein sequence \(\text{Q }={p}_{1}, {p}_{2}\cdots {p}_{l}\cdots {p}_{L}\) could be formulated as:$$P\left( {i, j} \right) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if P_{j} \cdots P_{j – 1 + k} = i^{th} base in \left\{ {7^{h} mer F_{f} symbols} \right\}} \hfill \\ 0 \hfill & {otherwise} \hfill \\ \end{array} } \right.,\quad j \in \left[ {1,2, \ldots L_{pro} } \right]$$
(2)
where \(i\in \left[\text{1,2}\cdots ,{7}^{k}\right]\) is the index of \({7}^{h}- mer {F}_{f} symbols\), and h denotes the high-order degree. In this paper, we set \(h=3\), and each protein sequence was converted into a \({7}^{h}\times max{L}_{lnc}\) numerical matrix.Protein feature CTDDThe composition–transition–distribution (CTD) method was utilized to extract protein sequence features, while the K-mer and one-hot encoding methods were also utilized to obtain sequence features. CTDD features indicate the pattern of an amino acid distribution in a protein or peptide sequence with specific structural or physicochemical properties35. We started from the first base in the protein sequence text, extending up to and including the residue labels for any given group of amino acid residues occurring 25/50/75/100% of the time. The positions of these residues were then divided by the length of the entire sequence to obtain a sequence capable of being represented in a numerical format suitable for machine learning.lncRNA feature Z-curveFor the lncRNA sequences, we also used the Z curve parameters to determine the frequencies of phase-specific tri-nucleotides (Z_curve_144bit) for extracting the computable features of the species contained in the lncRNA sequence. The Z_curve_144bit descriptor is a three-dimensional transformation of the frequencies of nucleotides (AUGC) of lncRNA sequences into a three-dimensional space using the Z-transformation technique36. In this part of the study, we extracted \(3\times 3\times 4\times 4=144\) feature descriptors using the Z_curve_144bit method. This descriptor can be calculated as follows:$$\left\{\begin{array}{c}{x}_{XY}^{k}=\left[{p}^{k}\left(XYA\right)+{p}^{k}\left(XYG\right)\right]-\left[{p}^{k}\left(XYC\right)+{p}^{k}\left(XYU\right)\right], X=A,C,G,U\\ {y}_{XY}^{k}=\left[{p}^{k}\left(XYA\right)+{p}^{k}\left(XYC\right)\right]-\left[{p}^{k}\left(XYG\right)+{p}^{k}\left(XYU\right)\right],Y=A,C,G,U\\ {z}_{XY}^{k}=\left[{p}^{k}\left(XYA\right)+{p}^{k}\left(XYU\right)\right]-\left[{p}^{k}\left(XYG\right)+{p}^{k}\left(XYC\right)\right], k=\text{1,2},3.\end{array}\right.$$
(3)
where \({x}_{XY}^{k}\), \({y}_{XY}^{k}\) and \({z}_{XY}^{k}\) are the coordinates of a point in a three-dimensional space. The frequency of tri-nucleotides \(XYA\), \(XYG\), \(XYC\) and \(XYU\) be denoted by \(p\left(XYA\right)\), \(p\left(XYG\right)\), \(p\left(XYC\right)\), and \(p\left(XYU\right)\), where X, Y = A, C, G and U.\(k = 1, 2, 3\) means the nucleotides are situated at the 1st 2nd, and 3rd codon positions.Model constructionCBIL‒VHPLI was created using the Keras application programming interface (API) and TensorFlow as a backend. CBIL-VHPLI is composed of a Conv1D layer, a maximum pooling layer, a bidirectional LSTM layer and, a final dense layer. Firstly, feature extraction of RPI18072 was performed through four different pipelines to obtain four different types of features. Next, the input lncRNA/protein sequence feature matrix was convolved through the Conv1D layer. The data was downscaled using max pooling layer. The feature vectors are connected to a bi-directional LSTM layer and the ReLU function is used as the activation function for the convolutional network. BiLSTM layer outputs a one-dimensional feature vector after which the output characteristics are fused together. Finally, the prediction results are output through the MLP layer (Fig. 2).Figure 2Flowchart of the CBIL‒VHPLI model.Conv1D LayerThe Conv1D layer is a one-dimensional convolutional layer that performs convolution on the input sequence with filters to learn local patterns or features. In this study, we employ a Conv1D layer as a feature extractor whose inputs are the sequence codes, i.e., matrix R and matrix P as described in Section “Protein feature CTDD”. As well as the protein sequence descriptors extracted by CTDD and the lncRNA sequence feature descriptors extracted by the Z-curve method are inputted into the Conv1D layer for processing, respectively. These Conv1D layer have 64 filters, which means it learns 64 different patterns or features from the input sequence. For lncRNA sequences, the convolution kernel size is set to 3, which means that each filter is a sequence of three consecutive values from the input sequence. For protein sequences, the convolution kernel size is set to 5. The padding parameter is set to “same”, which means that the input sequence is padded with zeros at both ends so that the output sequence has the same length as the input sequence. The activation function used in the Conv1D layer is the rectified linear unit (\(ReLU\)), which is a commonly used activation function in neural networks.MaxPooling1D LayerThe output from the convolutional layer is fed to a max pooling layer with a pool size of 2. The max pooling operation reduces the dimensionality of the input by taking the maximum value of each pair of adjacent values in the input. The MaxPooling1D layer is a pooling layer that performs down-sampling on the input sequence by taking the maximum value in each window of a specified size. The pool size is set to 2, which means that each window of the input sequence is of size 2 and the maximum value in each window is taken to generate the output sequence. The MaxPooling1D layer does not have a padding parameter, which means that the output sequence is shorter than the input sequence by a factor of the pool size.Bidirectional LSTM LayerThrough a convolutional filter and a maximum pooling layer, the CNN module learns the sequence information of lncRNA and proteins. Then, all channels of each subunit are separated into a new feature vector. To further exploit the sequence information, we adopt a BiLSTM neural network. The BiLSTM neural network is based on a recurrent neural network (RNN) and adds memory units in each neural unit of the hidden layer to make the memory information on the time series controllable and thus has a long-term memory function37,38,39. The Bidirectional LSTM layer has 32 units, which means that it has 32 LSTM cells. The default activation function for LSTM cells in Keras is the hyperbolic tangent \((tanh)\) function, and the recurrent activation function for LSTM cells in Keras is the sigmoid function40. The Bidirectional LSTM layer uses the default merge mode, which concatenates the outputs of the forward and backward LSTM layers along the last dimension.Dense LayerThe output from the bidirectional LSTM layer is fed to a fully connected dense layer with 64 units. The activation function used in this layer is \(ReLU\).Output LayerThe final layer in the model is a dense layer with a single unit and a sigmoid activation function. This layer produces a binary classification output, indicating the probability of the input sequence belonging to one of two classes. A threshold value can be applied to this output to make a binary classification decision. The formula can be expressed as follows:$$y = \sigma \left( {W_{n + 1} \left( {W_{n} f_{n – 1} \left( {h_{bi} } \right)} \right)} \right)$$
(4)
where y is the output, σ is the sigmoid activation function, \({W}_{n+1}\) is the weight matrix for the last dense layer, \({W}_{n}\) is the weight matrix for the second dense layer, \({f}_{n-1}\) is the activation function for the second dense layer (in this case, the rectified linear unit (ReLU)), \({h}_{bi}\) is the output of the BiLSTM layer, and \({W}_{n}\) and \({f}_{n-1}\) are the weight matrix and activation function, respectively, for the ith layer of the model.During training, the model is fed batches of input sequences along with their corresponding labels. The model calculates the loss between the predicted output and the true label and then updates its weights using the backpropagation algorithm. The loss function used in the CBIL‒VHPLI model is binary cross-entropy, and the optimizer used is Adam. For each dataset, we extracted 20% of the data as a test set and performed 5CV on three independent test sets (RPI2241, RPI1807, and RPI488) after pre-training the model to comprehensively evaluate the performance of the model. The batch size 256 and training epoch number 100. The learning rate of CNN is 0.001, and the loss function uses the binary cross entropy loss.Fine turningTransfer learning can solve the target data scarcity problem by applying the knowledge learned from a data-rich source task to a target task with a small body of data41.After conducting transfer learning on human lncRNA-viral protein interaction pairs, the fine-tuned model was compared with some existing algorithms: IPMiner, RPISeq-RF, and RPITER. During the fine-tuning process, we selected 80% of the samples from the fine-tuned dataset as the transfer learning training set and then randomly selected the 20% remaining data as the validation set. To address the problem that a data imbalance between the fine-tuned dataset and the training set may affect the model performance of a mode , we utilized an upsampling technique to increase the size of the test set so that the ratio of positive to negative samples was more balanced. Specifically, we randomly sampled RNA and protein sequences without pairwise information from the test set and treated them as noninteracting (negative) data. By doing so, we were able to increase the size of the test set and ensure that the proportions of positive and negative samples in the test set were similar to those in the training set.

Hot Topics

Related Articles