TriFusion enables accurate prediction of miRNA-disease association by a tri-channel fusion neural network

Overview of TriFusionThe main framework of TriFusion comprises the following four parts. (1) feature extraction for miRNAs and diseases; (2) encoding high-level representations for miRNAs and diseases via a tri-channel feature encoder; (3) fusion of features for miRNAs and diseases via a feature fusion encoder; and (4) prediction of miRNA-disease associations.Since similar miRNAs (diseases) often have close associative properties, we first construct multiple types of similarity matrices for miRNAs (diseases). For diseases, both the semantic similarity and Gaussian similarity are used to measure the similarity of two diseases. The semantic similarity and Gaussian similarity of two diseases are respectively defined based on their hierarchical relations and their interactions with miRNAs. For miRNAs, the similarity of two miRNAs is described by three types: sequence similarity, functional similarity, and Gaussian similarity. Sequence similarity is defined based on the similarity of their sequences, functional similarity is defined based on the similarity of their functions, and Gaussian similarity for miRNAs is defined through their interactions with diseases. The extracted similarity matrices serve as the original feature matrices for miRNAs and diseases.To comprehensively learn the association patterns between miRNAs and diseases, TriFusion develops a tri-channel feature encoder to encode the representations of miRNAs and diseases from different levels, including low-order graph encoding, high-order hypergraph encoding, and miRNA-disease interaction encoding. The direct relationships of an miRNA (disease) with its neighboring miRNAs (diseases) can effectively characterize the miRNA-disease association patterns. The low-order graph encoding channel of the tri-channel module is designed to calculate the representations of miRNAs (diseases) by message passing between miRNAs (diseases) and their multi-order neighbors. The high-level relationships between two miRNAs (diseases) hidden in their common neighbors can also effectively describe the association patterns. The high-order hypergraph encoding channel learns the representations by a hypergraph convolution on the constructed hypergraph of miRNAs (diseases). The relationships between target miRNAs and diseases contain inherent association information to measure their association patterns. The miRNA-disease interaction encoding channel can effectively capture the association representations by encoding the degrees and neighbor similarities of the nodes in the constructed miRNA-disease heterogeneous graph.The three representations learned by the tri-channel encoder describe the miRNA-disease association patterns from different levels, which should be carefully fused together to generate a complete representation. To achieve this, we design a feature fusion encoder that encompasses a biased Transformer encoder with an embedded residual connection, followed by a multi-layer graph convolution. The final classification is conducted by fusing the representations of an miRNA and a disease through a Hadamard product and then deriving an miRNA-disease association probability via a multi-layer MLP.Experimental settingsTo validate the performance of a method, we conduct 5-fold cross-validation tests on the HMDD v3.2 database via different manners for various purposes as follows.Random zero cross-validationAll known miRNA-disease associations are considered as positive samples, which are randomly divided into five non-overlapping subsets. During each iteration of cross-validation, a subset is chosen as the test set, complemented by an equal number of randomly selected negative samples. The remaining of all positive and negative samples serve as the training set. This process, known as random zero cross-validation, evaluates the capacity of a model to identify undetected miRNA-disease associations.Random multi-column zero cross-validationGiven the miRNA-disease association matrix, the test set is generated by randomly selecting and zeroing out 1/5 of the columns in this matrix, with the training set based on the remaining 4/5 columns. In addition, an equivalent number of randomly selected negative samples is added for balance. This process aims to test the effectiveness of a model in discovering the associations between known miRNAs and new diseases.Random multi-row zero cross-validationSimilar to the above, the test set is generated by randomly selecting and zeroing out 1/5 of the rows in this matrix, with the training set based on the remaining 4/5 rows. This process aims to test the effectiveness of a model in discovering the associations between new miRNAs and known diseases.State-of-the-art methods including MINIMDA19, MD-former34, DAEMDA36, AGAEMD37, AMHMDA38, and ELMDA39 are collected to compare with TriFusion. In this study, six common evaluation metrics are used to evaluate the performance of a model, namely area under the ROC Curve (AUC), area under the PR Curve (AUPR), Accuracy (ACC), F1 score, precision, and recall (see Supplementary Note 1 for detailed definitions of the metrics).TriFusion shows the best performanceWe compare the performance of TriFusion with the above six leading miRNA-disease association prediction methods on the same test set under the three types of cross-validations. According to the evaluation results, TriFusion achieves great improvements over all the methods across all the tests.Random zero cross-validationThe comparison results of Random Zero Cross-Validation are shown in Fig. 2 (see Supplementary Table 1 for detailed results). Among the compared methods, ELMDA and AGAEMDA are machine learning-based models, while the others are based on deep learning. We find that deep learning methods illustrate better performance than machine learning models, with both AUC and AUPR exceeding 94% (see Supplementary Fig. 1). Specifically, MINIMDA, which applies improved graph convolution to encode node information, achieves a very high AUC value of 94.97%, only lower than that of TriFusion. MD-former, which extracts features from heterogeneous graphs through random walks, obtains the second-highest AUPR value of 94.75%. Among these models, only TriFusion achieves both AUC and AUPR exceeding 95% (with its AUC and AUPR being 95.41% and 95.25%, respectively). Compared to these models, the relative increase in AUC and AUPR of TriFusion range from 0.47% to 3.97% and from 0.53% to 4.30%, respectively. Moreover, the recall of TriFusion even exceeds 90%, with an improvement of 2.01% over the second best method. To further illustrate the significance of the improvement, we select MDformer, the model with the second-best overall performance, and Trifusion, each running 10 times, for an independent samples t-test. The p-values for the tests based on AUC and AUPR are all smaller than 1e-10, indicating the the significance of the improvement made by TriFusion (see Supplementary Table 2 for details).Fig. 2: Comparison of TriFusion with other methods under three types of validations.This figure displays the values of AUC and AUPR of all the compared methods under three types of cross-validation conditions: Random Zero Cross-Validation, Random Multi-Column Zero Cross-Validation, and Random Multi-Row Zero Cross-Validation.Random multi-column zero cross-validationThe comparison results of Random Multi-Column Zero Cross-Validation are shown in Fig. 2 (see Supplementary Table 3 for detailed results). It is observed that most deep learning models again show much better performance than machine learning-based methods, with the AUC and AUPR values reaching over 90%. It is worth noting that, compared to other models, the AUC improvement of TriFusion ranges from 1.02% to 8.92%, and its AUPR improvement ranges from 1.20% to 8.39%, which demonstrates that TriFusion can better predict the associations between known miRNAs and unknown diseases.Random multi-row zero cross-validationPerformance evaluation is also conducted by Random Multi-Row Zero Cross-Validation and the results are shown in Fig. 2 (see Supplementary Table 3 for detailed results). After comparison, we find that TriFusion consistently performs better than all the other compared methods, with both AUC and AUPR exceeding 94%. Specifically, its AUC reaches 94.30%, with its improvement over the other methods ranging from 1.10% to 7.73%, and its AUPR achieves 94.01%, with an improvement ranging from 1.73% to 7.74%. This indicates that TriFusion shows better ability in predicting associations between new miRNAs and known diseases.Ablation studyTo measure the impact of the tri-channel feature encoder, each channel of the tri-channel feature encoder, and the feature fusion encoder, we conduct ablation experiments by removing certain encoding modules from the TriFusion model. Here, ablation studies are carried out in the manner of removing or altering only one component each time.Impact of the tri-channel feature encoderTo examine the influence of this encoder, we directly input the extracted similarity data between miRNA and disease through a fully connected layer into the feature fusion encoder, which results in a significant decrease in performance (Fig. 3). This indicates that the tri-channel approach is able to extract effective multi-level miRNA-disease association information, which contributes a lot in accurate association predictions.Fig. 3: Results of the ablation experiments.This figure illustrates the results of several ablation experiments. This two figures show the performance of TriFusion with several modules removed.Impact of each encoding channelTo further explore the impact of each channel, we conduct three experiments by respectively removing the graph convolution module, the hypergraph convolution module, and the miRNA-disease interaction encoding module. Results show that the performance of all three experiments clearly declines (see Fig. 3). It is worth noting that the impact of any channel is much lower than that of the whole tri-channel feature encoder (see Fig. 3), which indicates that any two channels among the three can capture most association features, and the application of all three channels achieves the best feature representations.Impact of the feature fusion encoderThe feature fusion encoder contains two parts: the biased Transformer and GCN. First, we simply add the three different kinds of features obtained by the tri-channel encoder and input the features directly into the classification module, which results in a great decline in performance (see Fig. 3). Next, to individually test the role of the biased Transformer module, we input the representations obtained from the tri-channel feature encoder directly into the GCN part for prediction, again resulting in a great decrease (see Fig. 3). This indicates that the biased Transformer encoder plays a crucial role in learning the complete representations of miRNAs and diseases. To further test the contribution of the GCN module, we remove it by inputting the fused representations directly into the classification module, and results show that the performance of TriFusion also declines (Fig. 3).Impact of the number of GCN layersTo assess the impact of the number of GCN layers within the feature fusion encoder on the overall predictive performance of the model, we carry out experiments with GCN layers of 2, 4, 6, 8, and 10, respectively. The experimental results, as shown in Fig. 4, indicate that the model performs best when the GCN has 6 layers.Fig. 4: Results of the ablation experiments.This two figures show the AUC and AUPR values for different numbers of GCN layers within the feature fusion encoder.Interpretation of the TriFusion modelTo deeply understand the learning mechanism of TriFusion in capturing the miRNA-disease association patterns, we try to interpret it in different manners. Firstly, we extract all the learned representations from the test set at continuous training stages and visualize their 2-dimensional projections via the t-SNE tool (Fig. 5). From the visualization, it is evident that TriFusion is gradually learning the association patterns and the segmentation of associations and non-associations is becoming increasingly clear according to the 2D t-SNE projections of the learned representations. Secondly, to verify what and how each module of TriFusion is learning, we respectively visualize the 2-dimensional projections of the representations learned from the tri-channel feature encoder, each of the three channels, and the feature fusion encoder. The visualization results show that each module is learning the miRNA-disease association patterns in different manners. Notably, in the interaction encoding channel, it seems that the associations and non-associations are not well classified. However, over 80% of the samples are arranged near the center, which are well classified.Fig. 5: Interpretation experiments of TriFusion.A The three figures illustrate the TriFusion training process, with blue points indicating positive samples and red points indicating negative samples. B The four figures display the visualization results of the learned representations from each of the three channels in the tri-channel feature encoder as well as the Transformer module in the feature fusion encoder, where green points represent positive samples and orange points represent negative samples.Case studiesIn this section, we conduct case studies on three different types of cancer: ovarian cancer, breast cancer, and prostate cancer to demonstrate the prediction capability of TriFusion. We used all known positive associations in HMDDv3.2, a total of 12,446 positive associations, as the positive training set. From the remaining unknown samples, we randomly selected an equal number of samples as negative and added them into the training set. After training, we obtained an 853*591 association prediction matrix, where the score of (i, j) represents the predicted association value between sample i and sample j. We then index the k-th column corresponding to the target disease, remove all known positive association points in the k-th column, and select the top 50 points with the highest scores from the remaining points. After that, we screen the top 50 predicted miRNAs and verify these prediction associations based on two other miRNA–disease association datasets, dbDEMC40 and HMDDv4.041 (Fig. 6).Fig. 6: Validation results for the top 50 miRNAs associated with three types of cancers (Ovarian cancer, Breast cancer, and Prostate cancer) predicted by TriFusion.Green lines indicate that the corresponding associations have been validated, while red lines denote the associations have not yet been validated.Ovarian cancer poses a serious risk to women’s health. However, its early detection is quite difficult because there are currently no clear early symptoms and screening methods that are proved effective. Fortunately, in ovarian cancer patients, the presence of miR-148b is as high as 92.21%, which makes it a key indicator for detecting the disease early42. In this case, all the top 50 miRNAs associated with ovarian cancer predicted by TriFusion are confirmed in the dbDEMC database, with the detailed verification of the remaining miRNAs listed in Supplementary Table 4.Breast cancer is among the most common cancers in women, accounting for approximately 25% of all cancer cases in females and presenting a significant threat to life. Recent studies indicate that in patients with breast cancer, the levels of certain miRNAs such as hsa-miR-126 and hsa-miR-10b are reduced in their tissues43. This provides a new method for the early detection of this type of cancer. In this case, except for hsa-miR-181a-1 and hsa-miR-153-1, which lack supporting data, the datasets have validated all of the top 50 miRNAs associated with breast cancer predicted by TriFusion. For specific verification details, refer to Supplementary Table 5.Prostate cancer is a leading type of cancer and the second primary cause of cancer-related deaths in men. It is especially prevalent in those over seventy, ranking as the third most common urological tumor. Current studies highlight a clear link between the serum miRNA expression patterns in prostate cancer and the tumor’s severity. Notably, changes include variations in 156 miRNAs, miR-16 and miR-141 levels are decreased in patients with prostatic hyperplasia and throughout various prostate cancer stages, whereas miR-34 levels are found to increase under the same conditions44. In this case, all the top 50 miRNAs predicted to be associated with prostate cancer by TriFusion, except for hsa-miR-181a-1, hsa-miR-138-1, and hsa-miR-337, which have no supporting data, are again confirmed in the datasets. Specific verification can be found in Supplementary Table 6.In summary, it is clear that TriFusion demonstrates excellent performance in the above case studies. Specifically, the top 30 predicted miRNAs associated with the three diseases are all validated, and it achieves a prediction accuracy of 96.7% for the top 50 miRNAs. These findings highlight the effectiveness of TriFusion in predicting miRNA-disease associations and its great potential in identifying new biomarkers and therapeutic targets.

Hot Topics

Related Articles