BEROLECMI: a novel prediction method to infer circRNA-miRNA interaction from the role definition of molecular attributes and biological networks | BMC Bioinformatics

Evaluation criteriaIn this study, we introduce five-fold cross-validation (five-fold CV) to evaluate the performance of the proposed method. five-fold CV divides the CMI data into five subsets at random, each time four subsets are used as the training set, one subset is used as the test set, and five experiments are performed until the predicted score of each subset. We comprehensively evaluate the predictive performance of the proposed model by combining multiple evaluation criteria including Accuracy. (Acc.), Precision. (Prec.), Recall and F1-score. The evaluation criteria can be represented as:$$Acc = \frac{TP + TN}{{TP + TN + FP + FN}}$$
(12)
$$\Pr ec. = \frac{TP}{{TP + FP}}$$
(13)
$${\text{Re}} call = \frac{TP}{{TP + FN}}$$
(14)
$$F1 – score = 2 \times \frac{{Precision \times {\text{Re}} \,call}}{{Precision + {\text{Re}} \,call}}$$
(15)
Among them, TP and TN respectively represent the number of positive samples and negative samples predicted correctly by the model; FP and FN respectively represent the number of negative samples and positive samples predicted incorrectly by the model. In addition, we introduce the receiver operating characteristic curve (ROC), and precision-recall curve (PR).In this work, we evaluate model performance through the five-fold CV based on three commonly used datasets in the field of CMI prediction. The CMI-9905 dataset was compiled by Wang et al. [28], including 9905 interactions between 2346 circRNA and 962 miRNA. The data set contains CMI with high confidence, which is used as the benchmark data set in this study. The CMI-9589 dataset comes from the circBank database [39], and we select 9589 interactions of CMI between 2115 circRNA and 821 miRNA with high confidence as training data. The CMI-753 dataset is collected from the circR2Cancer database [40]. Through strict screening and processing, we have obtained 753 interactions of CMI between 515 circRNA and 469 miRNA in the latest version of the data, all of which are supported by experiments. In this experiment, we use this data for a case study.In addition, we construct negative samples to balance the data set based on the uniqueness principle of sequence complementarity. Since miRNA has the response components (MRE), endogenous RNAs with a common MRE regulate each other’s expression by competitively binding to miRNA. This theory is called the competitive endogenous RNA hypothesis [41]. If a circRNA is determined to contain an MRE, it may be a potential target of a miRNA, and vice versa. In this study, negative samples are defined as interacting pairs that do not share common MREs, and we adopt specific negative sample construction methods for different data sets. Specifically, for the CMI-9905 dataset and CMI-9589 dataset, we construct all possible interactions between circRNA and miRNA, then delete the CMI with confidence score (score > 0) in the circBank database, and finally randomly select the same number of CMI as negative samples to participate in model training; For the CMI-753 data set based on real cases, since known interaction pairs are reported by experiments or papers, we select interaction pairs that have not been reported in existing studies as negative samples. Using this method can effectively avoid the potential CMI as a negative sample and ensure the reliability of the model performance.Performance evaluationIn this section, we use the CMI-9905 as the benchmark dataset for performance evaluation. The data of the model in the five-fold CV is objectively recorded in Table 1.
Table 1 The prediction result of BEROLECMI based on the benchmark datasetThe data in Table 1 shows that in the five-fold CV based on the benchmark data set, the average values of the six evaluation criteria of BEROLECMI are 0.8395, 0.8427, 0.8396, 0.8392, 0.9104, and 0.9086, respectively, which means that the proposed model can efficiently complete the prediction task of CMI. The ROC and PR curves of the BEROLECMI are shown in Fig. 5.Fig. 5The ROC curve (A) and PR curve (B) of the BEROLECMIPerformance on different datasetsTo reflect the generalization ability of BEROLECMI in CMI prediction, we perform prediction tasks based on all commonly used datasets in the field of CMI prediction (CMI-9589, CMI-753). According to our statistics, more than 80% of CMI prediction models use the dataset adopted in this work as benchmark data. The experimental results based on commonly used datasets are shown in Table 2.
Table 2 The prediction result of BEROLECMI based on commonly used datasetsThe data in Table 2 shows that in all common data sets in the field of CMI prediction, the AUROC of BEROLECMI based on the CMI-9589 data set exceeds 90%, and the AUROC based on the CMI-753 data set exceeds 75%, which means that the proposed model can effectively complete the CMI prediction task in commonly used data sets, and is expected to become a reliable candidate tool for CMI prediction.The validity of the model feature extractionIn this section, we verify the effectiveness of feature extraction for each part of the BEROLECMI through independent experiments. Specifically, we divide BEROLECMI into three modules: sequence feature extraction (BE-A), self-similarity feature extraction (BE-B), and structured embedding (BE–C), and then use the three modules separately for feature extraction and perform prediction tasks to evaluate the proposed model for each module feature extraction effectiveness. The experimental results are objectively recorded in Table 3. To facilitate comparison, we use histograms to visualize the data in Table 3, as shown in Fig. 6.
Table 3 Predicted result of different modules of BEROLECMIFig. 6Performance comparison of different modulesThe data in Table 3 shows that all the feature modules of the BEROLECMI can effectively complete the CMI prediction, which shows the effectiveness of the feature extraction strategy of the proposed method; among all three modules, the sequence feature extraction module has the lowest prediction results, which shows that the sequence feature is useful complements to model features; self-similarity features and structured embeddings achieve high predictive results, which means that combining functional similarity assumptions and role-defined structural embeddings can effectively improve the predictive performance of the model. Through the organic integration of the three feature extraction modules, we achieved the highest model prediction performance, verifying the effectiveness of the model construction.Optimal classification strategyIn this study, we conduct prediction tasks based on different classifiers to determine the best classification strategy for the proposed method. In prediction tasks based on CMI-9905 datasets, we use the lightGBM [42], Random forest (RF) [43], Logistic Regression (LR) [44], Support Vector Machine (SVM) [45], Linear Regression (LinR) [46] for CMI prediction tasks, and the best classification strategy was selected by comparing the performance of the proposed methods. The prediction results are shown in Fig. 7.Fig. 7Comparison of prediction results of different classifiers (A is the comparison of AUC results, B is the comparison of AUPR results)The data in Fig. 7 shows that the model using the lightGBM classifier achieves the highest predictive performance. LightGBM (Light Gradient Boosting Machine) [34] is an ensemble learning classifier based on Gradient Boosting Decision Tree (GBDT), which performs well in prediction tasks.LightGBM adopts the ensemble learning method to build a powerful prediction model by iteratively training multiple weak prediction models. It repeatedly optimizes the loss function, and each iteration builds a new decision tree on the residual of the previous model and then combines multiple decision trees to generate the final prediction result. LightGBM has the advantages of high efficiency, low memory consumption, accuracy, and support for large-scale data sets. Through the comparison of various classifiers, we finally choose LightGBM as the final classification strategy of the model.Compared with the existing modelsTo evaluate the advantages of BEROLECMI in the CMI prediction, we compared the proposed model with other models in the CMI prediction field based on three commonly used data sets.Lan et al. proposed the NECMA model, which combines circRNA-miRNA association, circRNA Gaussian kernel similarity and miRNA Gaussian kernel similarity to construct a heterogeneous network, then uses the NetMF algorithm based on matrix decomposition to extract hidden features in the heterogeneous network, and finally uses weighted neighborhood regularized logistic matrix decomposition and inner product obtain the circRNA-miRNA association probability [47]; Qian et al. proposed the CMIVGSD model, using the singular value decomposition algorithm and variational autoencoder to extract linear and nonlinear features from the circRNA-miRNA interaction network to predict unknown circRNA-miRNA interactions [32]; Wang et al. proposed the KGDCMI model, which combines RNA sequence feature and CMI network behavior feature to predict unknown CMI; Yu et al. proposed the first comprehensive prediction model of circRNA, SGCNCMI, which uses the graph neural network based on the contributing mechanism aggregates multi-modal information of molecules in biological networks and can achieve multiple predictions of circRNA-miRNA interactions, circRNA-gene interactions, and circRNA-cancer associations [30]; Guo et al. proposed the WSCD model, combined with the word2vec algorithm in natural language processing to process RNA sequences, used the SDNE algorithm to extract behavior features in the CMI network, and finally used a deep neural network to predict CMI [29]; He et al. proposed the GCNCMI model, using graph convolutional neural networks to aggregate node information to predict potential circRNA-miRNA interaction; Wang et al. proposed the JSNDCMI model, which for the first time combined denoising methods and local topological structure information in the CMI network for molecular feature extraction to predict unknown CMI [34]; Yao et al. proposed the IIMCCCMA model, which combined matrix factorization and improved inductive matrix completion algorithms predict unknown CMI [48]. Wang et al. proposed BioDGW-CMI, which combines BERT and wavelet diffusion to extract sequence and association network structure information of RNA molecules to predict potential CMI [35]. These models have achieved exciting results in CMI prediction, and we compare the BEROLECMI with these models to reflect the superior performance of the proposed model.The comparison data are recorded in Table 4. It is worth noting that all the comparison data in this study use the same data and verification methods as the comparison models, and the number of comparison models exceeds 70% of all models in the field of CMI prediction.
Table 4 Model performance comparison with different CMI prediction modelsData based on the CMI-753 dataset comes from the work of Yao et al. [48].The data in Table 4 shows that BEROLECMI surpasses all known models in the performance comparison of all three datasets.In addition, we conduct a paired t-test based on the CMI-9905 data set and known advanced models in the CMI field to evaluate the statistical difference between the proposed model and the state-of-the-art (SOTA) model. The experimental results are recorded in Table 5.
Table 5 Paired t-test results of the BEROLECMI and other models under five-fold cross-validationThe data in Table 5 shows that in the verification of the proposed method with the SOTA model, the P values were less than 0.05 confidence level, which means that there is a significant difference between the proposed model and the comparison model, and it has better statistical validity predict performance. There is no doubt that the proposed method is currently the most competitive in the field of CMI prediction.Case studyTo verify the practicability of BEROLECMI, we conducted a case study based on the CMI-753 dataset. The data in this dataset are all manually collected from existing literature and research, and all have experimental support.In the case study, we perform a prediction task based on 15 pairs of CMIs to simulate the prediction performance of the proposed model in unknown CMIs. Specifically, we remove the interacting pairs for case studies in the CMI-753 data and then use the known 738 pairs of CMIs for model training to predict unknown CMIs. The prediction results are recorded in Table 6.
Table 6 The prediction results in the case studyThe data in Table 6 shows that among the 15 pairs of CMI used for prediction, 14 pairs were successfully predicted, which means that BEROLECMI can effectively predict CMI in real cases, and it is expected to be a powerful tool to provide pre-selection for wet experiments.

Hot Topics

Related Articles