Predicting RNA sequence-structure likelihood via structure-aware deep learning | BMC Bioinformatics

In this section, we analyze the performance of NU-ResNet and NUMO-ResNet, and analyze them against the state-of-the-art approaches, ENTRNA presented in [12] and equilibrium probability from the ensemble of the RNA structures proposed in [32].Section “Data sets” characterizes the data sets utilized in this research. Section “Models Comparison” presents the comparison of performance of NU-ResNet and NUMO-ResNet with data-driven approach, ENTRNA, and model-driven approach, equilibrium probability from the ensemble of the RNA secondary structures. Section “Analysis of NU-ResNet and NUMO-ResNet” introduces the analysis of NU-ResNet and NUMO-ResNet, including the comparison between NU-ResNet and NUMO-ResNet, convergence behavior of proposed models, and model robustness analysis of proposed models. Section “Performance of NU-ResNet and NUMO-ResNet across independent RNA families” shows the performance of testing NU-ResNet, NUMO-ResNet, ENTRNA, and Equilibrium Probability across independent RNA families.Data setsThe samples utilized in this research are extracted from Protein Data Bank (PDB), in particular from the RNA STRAND database [43]. This is a commonly used data set in the literature [9, 12, 20]. When generating the dataset for our analysis, we only consider RNAs validated by X-Ray or NMR, thus ensuring the availability of the ground truth for each sequence. Synthetic RNAs and RNAs with pseudoknots are not considered.Both our deep learning models require positive and negative samples for training. While the positive samples are the RNA sequence-structure pairs in the data set, we use the Positive-Unlabeled (PU) Learning method [12, 44] to generate multiple negative samples for the same RNA structure. Specifically, we use RNAinverse [45] and incaRNAtion [46] to generate 101 negative sequence candidates for each RNA structure. Not all the generated sequences are accepted as negative samples. Similar to the approach in [12, 47], we accept negative samples that satisfy three requirements:

repetition constraint: as first requirement, we ask that any sub-sequence of an RNA sequence can have at most r consecutive identical nucleotides. In this analysis, we set \(r=6\);

The second constraint is that the only allowed base pairs in RNA sequence is AU, CG, and GU;

The third constraint is that the most or least occurring nucleotide within the sequence is either G or C.

These three constraints can help us to select the negative samples that are “reliable”. Upon screening the candidates against the constraints, we calculate the five features, Normalized Sequence Segment Entropy with Segment Size 3, GC Percentage, Ensemble Diversity, Expected Accuracy, Pseudoknot-free RNA normalized free energy, proposed in [12] and compute the Euclidean distance between each negative sample candidate and the corresponding positive sample. The negative sample candidate with the largest distance from the positive sample is finally selected and included in the data set. As a result, each RNA structure has associated a positive and a negative sequence.The positive samples that are the ground truth within the data set have the base pairs other than AU, CG, and GU pairs. In order to keep consistency with negative samples, we only consider the AU, CG, and GU pairs in the positive samples. Otherwise, whether having base pairs except AU, CG, and GU pairs will become a main feature to classify the positive samples and negative samples, which is not what we expect.The longest RNA sequence within the data set utilized in this research has 408 nucleotides, and the shortest RNA sequence has 12 nucleotides. To unify the size of inputs of NU-ResNet and NUMO-ResNet, we pad both the 3D RNA matrix and the nucleotide localized information matrix with 0. We choose to use 410 as the maximum length \({\mathcal {L}}\), resulting in 3D RNA matrixes with size \(\left[ 410 \times 410 \times 4\right]\) and nucleotide localized information matrixes with size \(\left[ 410 \times 18\right]\). When utilizing the CDF of normal distribution to rescale free energy to the \(\left[ 0, 1\right]\) interval in the nucleotide localized information matrix, we set \(\mu = 0\) and \(\sigma = 5\).In this work, the RNA sequence-structure pairs are randomly selected to generate the training (TrDS, with 81% of the inputs), validation (VDS, with 9% of the inputs), and testing (TeDS, with 10% of the inputs) datasets. This data split setting is the same as proposed in PreRBP-TL [48]. The TrDS, VDS, and TeDS have 259 RNAs, 29 RNAs, and 32 RNAs, respectively. Considering the negative samples results in 518, 58, and 64 samples, respectively.In the following, we analyze the performance of both our models with the associated largest validation accuracy and lowest validation loss. For these models, we also show the 10-fold CV performance under the combined TrDS, VDS, and TeDS datasets.Models comparisonIn order to evaluate the performance of NU-ResNet and NUMO-ResNet, we compare them with data-driven approach, ENTRNA [12], and model-driven approach, equilibrium probability proposed in [32]. The data-driven approach uses the Machine Learning to develop the model where the RNA sequence-secondary structure pairs are encoded by using feature engineering and the parameters of the model are learnt from the training of the model. The model-driven approach develops the model based on the Physics knowledge. Specifically, the ENTRNA evaluates an RNA sequence-secondary structure pair based on its features, while the equilibrium probability approach evaluates an RNA sequence-secondary structure pair based on its free energy. In section “Models comparison with data-driven approach” introduces the comparison of NU-ResNet and NUMO-ResNet with data-driven approach, ENTRNA. Section “Models comparison with model-driven approach” introduces the comparison of NU-ResNet and NUMO-ResNet with model-driven approach, equilibrium probability.Models comparison with data-driven approachPerformance MetricsSince NU-ResNet, NUMO-ResNet, and ENTRNA are trained as binary classification models, we propose several metrics to comprehensively analyze the trained architectures. The metrics we utilize include the area under curve receiver operator characteristic (AUCROC), the Matthews correlation coefficient (MCC), accuracy, precision, recall, and specificity. The AUCROC has a threshold invariant characteristic which can comprehensively evaluate the classification model. In addition, AUCROC has been proven to be equivalent to the probability that a randomly chosen positive sample can be ranked higher than a randomly chosen negative sample by the classification model [49]. Here, \(MCC = \frac{(TP \times TN – FP \times FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\), \(\text {Accuracy} = \frac{TP+TN}{TP+FN+TN+FP}\), \(\text {Precision} = \frac{TP}{TP+FP}\), \(\text {Recall} = \frac{TP}{TP+FN}\), and \(\text {Specificity} = \frac{TN}{TN+FP}\). Within these formulas, TP, TN, FP, and FN refer to the number of true positive, true negative, false positive, and false negative respectively. AUCROC, accuracy, precision, recall, and specificity are all defined in the range \(\left[ 0,1\right]\). The MCC metric is defined in the range \(\left[ -1,1\right]\). For all of the metrics, higher value indicates better performance.NU-ResNet and NUMO-ResNet compared to ENTRNAAs previously mentioned, we record the models with best validation accuracy and best validation loss resulting from training. The set of parameters of the model with best validation accuracy and best validation loss is referred to as \(\varvec{\vartheta }^{*}_{a}\) and \(\varvec{\vartheta }^{*}_{\ell }\) respectively. The performance of the resulting NU-ResNet and NUMO-ResNet on the TeDS is shown in Table 2. We retrain and test the state-of-the-art RNA sequence-secondary structure pair evaluation model, ENTRNA, on TrDS and TeDS, respectively. The performance of ENTRNA on the TeDS is also shown in Table 2. It can be observed how all of four models outperform ENTRNA on the TeDS.In Table 2, we observe that NU-ResNet with \(\varvec{\vartheta }^{*}_{a}\) outperforms ENTRNA and on all metrics except for the recall where they achieve the same performance. NU-ResNet with \(\varvec{\vartheta }^{*}_{\ell }\) outperforms ENTRNA on 5 out of 6 metrics (i.e. accuracy, AUCROC, MCC, precision and specificity). In addition, NUMO-ResNet models with \(\varvec{\vartheta }^{*}_{\ell }\) outperform ENTRNA on all metrics. The NUMO-ResNet with \(\varvec{\vartheta }^{*}_{a}\) outperforms ENTRNA on 5 out of 6 metrics (i.e. accuracy, AUCROC, MCC, precision, and specificity). The performance of NU-ResNet and NUMO-ResNet is superior to the performance of ENTRNA except the recall. Here, the low precision, low specificity, and high recall of ENTRNA indicates that the model is more inclined to classify the sample as positive.From the model comparisons among NU-ResNet, NUMO-ResNet, and ENTRNA. We have following two conclusions.

The overall classification performance of NU-ResNet or NUMO-ResNet is superior to ENTRNA on TeDS.

The experiments indicate the effectiveness of the methods utilized by NU-ResNet and NUMO-ResNet to encode the RNA sequence-structure pair.

Table 2 Models performance on TeDSModels comparison with model-driven approachEquilibrium probabilityAccording to the method proposed in [32], the equilibrium probability is defined with respect to the set of all pseudoknot-free RNA structures for a given RNA sequence [32]. Specifically, the authors define the equilibrium probability as$$\begin{aligned} p(str_{i})=\frac{\exp (-\frac{E(str_{i})}{RT})}{\sum _{i=1}^{N}\exp (-\frac{E(str_{i})}{RT})}, \end{aligned}$$where \(str_{i}\) is i-th RNA structure, \(E(str_{i})\) is the associated free energy of \(str_{i}\), N is the number of all RNA structures in ensemble. Finally, R is the gas constant and T is the thermodynamic temperature [22]. From this formula, we observe that the RNA structure with lower energy has higher equilibrium probability for a given sequence.In this analysis, we adopt the ViennaRNA package [10] to calculate the equilibrium probability of the RNA sequence-structure pairs. We firstly use mfe() to obtain the MFE structure for a given RNA sequence and the corresponding free energy of this MFE structure. Then, we use exp_params_rescale() with setting the parameter equal to MFE value to rescale Boltzmann factor for computing partition function. Finally, we use pf() and pr_structure() to cauculate the partition function and the associated equilibrium probability for the given RNA structure, respectively.NU-ResNet and NUMO-ResNet compared to Equilibrium ProbabilityWe calculate the equilibrium probability based on the ensemble for all data on TeDS. In TeDS, there are 32 RNAs in total. On 13 out of these 32 RNAs, the equilibrium probability of their corresponding positive samples are 0. This is because that their free energies are much greater than the free energies of associated RNAfold [10] predicted structures which are obtained by approximating MFE. For 32 positive samples in TeDS, there are 22 samples whose free energies that are greater than the free energies of the associated RNAfold predicted structures. And among these 32 positive samples, only 7 positive samples’ structures are same with the corresponding RNAfold predicted RNA structures, which is 21.88%. In terms of the free energy comparison between the positive sample and negative sample of each RNA in TeDS, 28 out of 32 RNAs whose corresponding positive sample’s free energy is less than the corresponding negative sample’s free energy. In other words, using free energy value to directly classify the positive sample and negative sample has 87.5% accuracy, which is still lower than the accuracy of NU-ResNet or NUMO-ResNet. By removing all the zero and extremely small values of equilibrium probability, there are 7 RNAs within the TeDS having associated positive sample’s equilibrium probability less than the associated negative sample’s equilibrium probability, which means 21.88% samples are evaluated wrongly. For the 7 RNAs whose positive samples have free energies equaling to the free energy of RNAfold predicted structures, the equilibrium probability classifies all the 7 associated positive samples correctly and 3 out of 7 associated negative samples correctly when threshold equals to 0.5. When setting the threshold as 0.6, the equilibrium probability classifies all the associated 7 positive samples and 7 negative samples correctly. Hence, the equilibrium probability has admirable performance when dealing with the RNAs whose ground truth structures are same with RNAfold predicted structures in nature. However, when dealing with the RNAs whose ground truth structures are not same with RNAfold predicted structures, the equilibrium probability has the limitation. The data-driven approach is a direction to overcome this limitation. Therefore, our data-driven approaches, NU-ResNet and NUMO-ResNet, are good complement to RNA sequence-secondary structure evaluation research field.The significant difference in the performance between four proposed models and equilibrium probability is mainly from the different mechanisms between these two types of the approaches. The NU-ResNet and NUMO-ResNet are data-driven approaches. However, the equilibrium probability is Physics-based approach. In these 32 RNAs, there are two positive samples whose equilibrium probabilities are greater than 1. This is because that these two positive samples have AC pair in their structures which leads to their free energies are less than the corresponding ensemble free energies. The NU-ResNet and NUMO-ResNet neglect the base pairs other than AU, CG, as well as GU pairs and limit the output score ranging from 0 to 1. The equilibrium probability only considers the AU, CG, and GU pairs when they build the ensemble. However, when dealing with some ground truth RNAs which have base pairs other AU, CG, and GU pairs, these ground truth RNAs could have free energies less than that of ensemble, which causes that the corresponding equilibrium probability is greater than 1. The advantage of data-driven approaches is that they learn the knowledge from the data, which can benefit the domain by using the knowledge learnt from big data.From the model comparisons among NU-ResNet, NUMO-ResNet, and equilibrium probability. We have following two conclusions.

The data-driven approaches, NU-ResNet and NUMO-ResNet, can learn the knowledge from the data source directly. The model-learnt knowledge is able to benefit the RNA evaluation domain.

The data-driven approaches, NU-ResNet and NUMO-ResNet, are good complement to RNA sequence-secondary structure pair evaluation field because purely using free energy to evaluate RNA sequence-secondary structure pair has limitation in classification performance. The good classification performance can benefit the RNA secondary structure prediction and RNA inverse folding field.

Analysis of NU-ResNet and NUMO-ResNetComparison between proposed modelsTable 2 shows that NUMO-ResNet with \(\varvec{\vartheta }^{*}_{\ell }\) outperforms NU-ResNet with \(\varvec{\vartheta }^{*}_{a}\) and \(\varvec{\vartheta }^{*}_{\ell }\) in all 6 metrics. In AUCROC, the NUMO-ResNet with \(\varvec{\vartheta }^{*}_{a}\) is superior to NU-ResNet with \(\varvec{\vartheta }^{*}_{a}\) and is equal to NU-ResNet with \(\varvec{\vartheta }^{*}_{\ell }\).The results of the experiments follow our expectation because NUMO-ResNet incorporates more features compared to NU-ResNet. Intuitively, NUMO-ResNet should at least have the same performance with NU-ResNet because NUMO-ResNet has the whole input that NU-ResNet has. The results of experiments also show the advance of motif-based features extracted by NUMO-ResNet compared to the input only incorporating sequence and structure information employed by NU-ResNet.Convergence behavior of NU-ResNet training compared to NUMO-ResNet training Here, we provide insights into the training process of the proposed models. In particular, we analyze the validation loss and accuracy metrics as a function of the training effort (i.e., number of epochs). Figure 5a shows that NU-ResNet validation loss presents larger fluctuation than NUMO-ResNet. Figure 5b confirms this observation when accuracy is considered: NU-ResNet has larger fluctuation in validation accuracy compared to NUMO-ResNet. This finding suggests that the motif-based features extracted by NUMO-ResNet do have positive effects on model when it learns the RNA data because its validation loss and validation accuracy are more stable compared to NU-ResNet.Fig. 5a: The training and validation loss. b: The training and validation accuracyModels robustness analysis Since we utilize a weighted sampler to sample the data during the training which has randomness, the performance of trained models on testing data may be affected by this randomness. To verify the robustness of the trained models, we perform a 10-fold CV on both NU-ResNet and NUMO-ResNet.Similar to the previous analysis, for each iteration of the validation routine, we consider two models, one with the best validation accuracy, and one with the best validation loss. Table 3 shows the 10-fold CV results from our models as the average of the performance resulting from 10 iterations of the approach. The 10-fold CV results presented in Table 3 confirm that NU-ResNet and NUMO-ResNet are capable of tackling different groups of RNAs across the data set used in this research.Table 3 10-fold cross validation results from NU-ResNet and NUMO-ResNetPerformance of NU-ResNet and NUMO-ResNet across independent RNA familiesInspired by the findings introduced in [50], we conduct the experiments to analyze the performance of NU-ResNet and NUMO-ResNet across independent RNA families. Specifically, we train and validate the NU-ResNet and NUMO-ResNet only on Transfer RNA and Ribosomal RNA because they have first two largest data sizes compared to other RNA families in the data set utilized in this research. The training and validation data have 118 RNAs and 14 RNAs respectively. By considering the negative samples, there are 236 and 28 samples in training data and validation data respectively. The ratio between training data and validation data is consistent with the ratio between TrDS and VDS in section “Data sets”. Then we test the trained models on all remaining RNA families individually. In addition to RNA families within the PDB data set we utilize in this research, we also include the Riboswitch data from [51] as an independent testing data set. The statistics of each RNA family are summarized in Table 4. We also retrain and test the ENTRNA in same data sets for data-driven approaches’ comparison. The results are shown in Tables 5 and 6. In addition, we test the equilibrium probability for comparing the NU-ResNet and NUMO-ResNet with model-driven approach. In addition to RNA families within the PDB data set we utilize in this research, we also include the Riboswitch data from [51] as an independent testing data set.Table 4 Nanoparticles length range grouped by familyData-driven models performance across RNA familiesWe select from the PDB data set the Transfer RNA, 5S Ribosomal RNA, 16S Ribosomal RNA, 23S Ribosomal RNA, and other Ribosomal RNA to from the training data set. We use the same hyperparameters setting introduced in section “Methods” to retrain the NU-ResNet and NUMO-ResNet. Then we test the resulting NU-ResNet and NUMO-ResNet models “out of sample” on Group I intron, Group II intron, SRP RNA, Viral and Phage, Small Nuclear RNA, Ribonuclease P RNA, Internal Ribosome Entry Site, Hairpin Ribozyme, Hammerhead Ribozyme, Riboswitch, other Ribozyme, and other RNA individually. We also retrain and test ENTRNA using the same data sets. Because the model complexity of NUMO-ResNet is higher than that of NU-ResNet, more training data are expected for NUMO-ResNet compared to NU-ResNet. However, in order to avoid the overlap RNA families between training data and testing data, we need to exclude the other RNA from the training data set, which causes that the size of the training data decreased compared to the TrDS.In Table 5, there are 11 testing RNA families in total. The NU-ResNet has better or equal performance compared to ENTRNA in 9, 9, and 10 testing RNA families based on accuracy, AUCROC, and MCC, respectively. The NUMO-ResNet has better or equal performance compared to ENTRNA in 7, 7, and 7 testing RNA families based on accuracy, AUCROC, and MCC, respectively.Table 5 Models performance across RNA families on Accuracy, AUCROC, and MCCTable 6 shows that both of NU-ResNet and NUMO-ResNet have balanced performance in Precision, Recall, and Specificity across all testing RNA families except the Group I intron, which implies that NU-ResNet and NUMO-ResNet are not biased when they are tested on most of these new RNA families. However, ENTRNA has the biased performance on Group I intron, Hairpin Ribozyme, and Hammerhead Ribozyme. In terms of the model performance on handling both of positive samples and negative samples across different RNA families, the NU-ResNet and NUMO-ResNet show more balanced capability than ENTRNA.Table 6 Models performance across RNA families on Precision, Recall, and SpecificityTables 5and 6 show that both the NU-ResNet and NUMO-ResNet models outperform the competitors over the SRP RNA, Ribonuclease P RNA, Internal Ribosome Entry Site, and Hammerhead Ribozyme families.The results for the aggregate data sets “C” and “C\(+\)” in Table 5 shows that NU-ResNet and NUMO-ResNet outperform ENTRNA across all metrics. Because “C” data set and the training data have no overlap RNA families, the NU-ResNet still has the AUCROC as 0.9458, which is consistent with the AUCROC in Table 2. This shows the generalizability of NU-ResNet. On the other hand, NUMO-ResNet shows a decrease in the performance. We believe such difference is not due to lesser generalizability of the model, rather to the reduced size of the training data set compared to the TrDS and the larger number of parameters required by NUMO-ResNet as compared to the NU-ResNet model.By testing the NU-ResNet, NUMO-ResNet, and ENTRNA across different RNA families, we can obtain the following conclusions.

The overall testing performance of NU-ResNet and NUMO-ResNet across different RNA families is superior to ENTRNA.

The experiments show that the NU-ResNet has the admirable performance when handling the data from the new RNA families.

Equilibrium probability performance across RNA familiesIn order to compare NU-ResNet and NUMO-ResNet with equilibrium probability in handling different RNA families, we test the equilibrium probability in each data set listed in Table 5. The performance of equilibrium probability in different RNA families is shown in Table 7. Specifically, in Table 7, there are in total 80.73% RNAs whose positive sample has the free energy greater than the free energy of the corresponding RNAfold predicted structures.Table 7 shows in column \(FE_{pos}<FE_{neg}\) the percentage of RNAs that exhibit positive samples with lower free energy compared with the negative samples within each RNA family. As an example all the data within the Group II intron and Ribonuclease P RNA families exhibit this property. As a result, a classifier that uses free energy to distinguish positive from negative samples (like the equilibrium probability does) will achieve a good performance. On the other hand, families such as Group I intron and Hairpin Ribozyme exhibit this property for only 33.33% and 36.36% of the cases. Considering only RNAs which have negative samples in the “C” and “C\(+\)”, we observe that 75.26% and 79.26% exhibit positive samples with lower free energy compared to the negative sample, thus resulting in an accuracy of 75.26% and 79.26% for the equilibrium probability method. This performance is superior to both ENTRNA and NUMO-ResNet, while NU-ResNet still shows the best results.Table 7 The performance of equilibrium probability across RNA familiesFrom the testing performance of NU-ResNet, NUMO-ResNet and Equilibrium Probability across different RNA families. We have the following conclusions.

Using free energy to evaluate the RNA sequence-secondary structure pairs has the different performance in different RNA families. The NU-ResNet and NUMO-ResNet have more consistent performance in different RNA families.

Leveraging the knowledge learnt from data-driven approaches can benefit the classification performance of the models across independent RNA families.

Hot Topics

Related Articles