Data augmentation based on the WGAN-GP with data block to enhance the prediction of genes associated with RNA methylation pathways

Influence of hyperparameters and training strategiesTo investigate the effect of the oversampling rate \(\alpha\) on the model performance, We performed cross-validation on the training set using various oversampling rates. Our experiments involved nine different \(\alpha\) values across five classifiers. Generally, each classifiers exhibited varied valid performance with different \(\alpha\) values, and fixed \(\alpha\) values also produced diverse results across classifiers. As shown in Fig. 2, the highest accuracy (0.9757) was achieved with an \(\alpha\) of 0.4 paired with the Logistic Regression (LR) classifier. Conversely, the lowest accuracy (0.8771) occurred with an \(\alpha\) of 0.1 paired with the Gradient Boosting (GB) classifier, indicating a difference of approximately 14.3%. Notably, Random Forest (RF) was the most impacted by variations in \(\alpha\), with a performance difference of 9.49%. In contrast, Gaussian Naïve Bayes (GNB) was the least affected, showing a difference of only 1.27% between its best (0.9539) and worst (0.9412) results. The performance drop for other classifiers compared to their best results was 4.48% for Support Vector Machine (SVM), 6.34% for GB, and 1.82%for LR. This suggests that inadequate synthetic samples (\(\alpha\) close to 1) or excessive samples (\(\alpha\) close to 0) can result in performance degradation. Based on these validation results, unless otherwise specified, we will set \(\alpha\) to 0.4 for the following analysis.Fig. 2Impact of oversample rate \(\alpha\), the horizontal axis is the oversampling rate, the vertical axis is the valid accuracy score, and the lines in different categories indicate different classes of classifiers.Ablation studyThe data block constructure and oversampling process are two major components in our approach. To study the effectiveness of these components, we created two variants of the scheme: one without the data block structure and another without the oversampling process. These two variants were used to train five machine learning models. We then evaluated their performance using four metrics to assess the impact of the data block constructure and oversampling on overall model performance. Figure 3 depicts the results of the ablation study.Fig. 3Overview of ablation study.In general, our proposed complete scheme exhibits superior predictive performance across all five classifiers. However, without the data block structure or proper oversampling, performance degradation occurs to varying extents. Among the classifiers, GNB classifier shows the least sensitivity to these components, with performance reduction of 1.76% (MCC), 0.7% (F1 score), and 3.1% (recall) when data block constructure is removed and 7.5% (MCC), 3.4% (F1 score), 2.7% (recall) when WGAN-GP oversampling is missing. In contrast, the other four classifiers demonstrate greater sensitivity to the absence of data block constructure. For instance, GB experiences a 26.9% drop in accuracy, 60.2% in recall, 44.6% in F1 score, 42.3% in MCC. Similarly, LR shows a decrease of 16.9% (accuracy), 40.9% (recall), 24.3% (F1 score), 27.0% (MCC). RF faces reductions of 19.4% (accuracy), 45.8% (recall), 28.9% (F1 score), 30.7% (MCC), while SVM observes declines of 26.5% (accuracy), 59.8%(recall), 44.3% (F1 score), 41.7% (MCC). When the oversampling process is absent, The situation is more moderate. MCC performance drops by 5.3% for GB, 5.9% for SVM, 10.2% for RF, and 6.4% for LR, with similar decline in other metrics. These results clearly demonstrate that merely generating a large number of synthetic samples to address imbalanced datasets is ineffective for highly imbalanced classification problems. In contrast, our proposed scheme, which integrates the benefits of both oversampling and undersampling, achieves better performance. By incorporating data blocks, it significantly reduces the need for synthetic samples, preserves the original sample distribution, and effectively expands the dataset. Data block constructure also facilitates the use of multiple weak classifiers, enabling ensemble techniques that capture diverse patterns in the data, which reduces individual model bias and improving generalization.A total of 61 weights were saved based on the preset parameters. The additional CTST experimental results for these 61 weights are displayed in Fig. 4, sorted by variance. After conducting 200 repetitions of the experiment, slight differences were observed in the mean values of all generator test results. The majority of the generator test results still adhere to the preset hyperparameter criteria, falling between 0.498 and 0.502 as denoted by the green dots in Fig. 4. A few generators produced test results that deviate from the preset criteria, yet remain within the acceptable range for downstream tasks, ranging from 0.4 to 0.498 and 0.502 to 0.6, indicated by the orange points. These generators also show an increase in variance. In contrast, a few generators, represented by the red dots, exhibit test results significantly deviating from the 0.5 criterion, with an absolute difference exceeding 0.1, alongside higher variance. In such cases, the samples generated by these generators could introduce bias into the classifier model’s training. These results affirm the importance of incorporating additional CTST to refine the selection process.Fig. 4CTST result over 61 saved model weight, sorted based on their variance.WGAN-GP successfully generated high-quality positive samplesOverall, the samples synthesized by WGAN-GP exhibited a strong alignment with the original positive samples. We incorporated supplementary CTST after each training epoch. As shown in Fig. 5, at the beginning of generator training, the generated samples significantly diverged from the distribution of positive samples, resulting in a CTST accuracy of 1.0. This observation underscores the effective discrimination achieved by the CTST classifier between the generated and positive samples.Around the \(200^{th}\) epoch, the generator began capturing the distinctive feature distribution of the positive samples. Consequently, an overlap emerged in the feature space between the generated and positive samples, reducing the CTST accuracy to approximately 0.7. As training progressed, by the \(500^{th}\) epoch, this overlapping region expanded further, causing the CTST accuracy to diminish to about 0.6.After over 800 additional training iterations, WGAN-GP successfully internalized the genetic feature distribution of authentic positive samples. Evidently, the CTST accuracy decreased to 0.5 and exhibited fluctuation around this value, signifying the generator’s ability to effectively align with the positive sample distribution. Additionally, the training loss curve further indicated that the generator attained a local optimum as the CTST accuracy reached 0.5.Furthermore, independent CTST evaluations on each saved generator model unveiled that more than half of the generators consistently uphold a classifier accuracy around 0.5. This result highlights the effectiveness of WGAN-GP in producing high-quality synthetic samples, as illustrated in Fig. 5.Fig. 5Distribution changes between positive and synthetic samples. a epoch 0; b epoch 200; c epoch 500; d epoch 800.While WGAN-GP partially addressed these difficulties by incorporating gradient penalties, proper neural network design and hyperparameter tuning remained crucial for successful GAN training. In addition to visualizing samples at various epochs, observing changes in loss during GAN training proved effective27. For WGAN-GP, the generator’s loss reflected the disparity between generated and real samples. As shown in Fig. 6, the generator initially exhibited poor sample fitting ability, resulting in a rise in loss throughout the training process. After undergoing adversarial training with the discriminator, the generator started capturing the underlying sample distribution, causing a decrease in loss. As the training progressed, the generator’s loss began to fluctuate, indicating its position in the intermediate stage. Around 600–800 epochs, the generator and discriminator reached a Nash equilibrium in the minimax game, as indicated by the consistent reduction and eventual stabilization of loss. This suggests the generative adversarial network had reached an approximate optimal state.Fig. 6WGAN-GP train loss over 5000 epochs. Blue line indicates discriminator loss and orange line indicates generator loss.Performance comparisonAs shown in Table 2, our approach consistently exhibits superior prediction performance across most classifier combinations. For instance, when paired with the SVM classifier, our method improves accuracy by 16.01%, recall by 38.81%, F1 by 23.34%, and MCC by 25.14% compared to the second-best approach. Similarly, the combination with GB classifier achieves respective improvements of 13.77%, 33.91%, 19.35%, and 21.87%. Additionally, when combined with RF classifier, our approach surpasses the second-best model by 16.83%, 40.58%, 24.26%, and 26.73%. For the LR classifier, our method outperforms the runner-up by 11.69%, 30.39%, 15.92%, and 18.89%. Across most performance metrics, our scheme consistently attains top-tier results, securing the highest accuracy (0.9320), F1 (0.9354), and MCC (0.9327) when combined with LR. However, with GNB classifier, the RandomOversample scheme achieves the best performance, boasting an average accuracy of 0.9183, a recall of 1.0, F1 of 0.9261, and an MCC of 0.8505. This suggests that the RandomOversample method is more inclined to identify positive samples. Nonetheless, our scheme remains competitive, producing accuracy, recall, F1, and MCC scores that closely approach those of the leading scheme, with only a minor differences.Notably, some approaches encountered a significant problem, misclassifying all test set samples as negative (see32, Table 3). This led to their average accuracy, recall, F1, and MCC values dropping to 0.5, 0, 0, 0, respectively. In such cases, these approaches were deemed unusable. Among the three compared schemes, the combination of SMOTE with SVM exhibits invalid values. Similarly, RandomOversample encounters the same issue when combined with SVM. While BorderlineSMOTE mitigates the problem,it still underperformed compared to our approach. In contrast, our method demonstrated better data compatibility and was more effective at handling highly imbalanced gene expression data.Table 2 Accuracy, Recall, F1 and MCC for 4 approaches.To investigate the differences in predictive performance between samples synthesized by WGAN-GP and other schemes, we substitute WGAN-GP with three oversampling schemes, creating three new variants of our approach. These variants balanced the data blocks rather than the entire dataset before training, following the same methodology as our proposed approach. A scenario with no oversampling was used as the baseline for comparison.Table 3 shows the test results for various oversampling schemes combined with data block construction. Generally, samples generated by WGAN-GP outperform the other four compared schemes across most metrics and classifiers, achieving the highest overall scores in accuracy, recall, F1, and MCC. Specifically: With the SVM classifier, WGAN-GP attained the highest accuracy (0.9232), recall (0.9342), F1 (0.9230), and MCC (0.8487). With the GB classifier, WGAN-GP achieved the highest accuracy (0.9269), F1 (0.9266), and MCC (0.8562). When using the RF classifier, WGAN-GP achieved the highest accuracy (0.9315), recall (0.9321), F1 (0.9322), and MCC (0.8650). With the LR classifier, WGAN-GP reached the highest accuracy (0.9320), precision (0.9327), F1 (0.9328), and MCC (0.8665). With the GNB classifier, WGAN-GP achieved the highest Recall (0.9785), F1 (0.9178), and MCC (0.8324). Among the results, the WGAN-GP with LR combination attained the highest overall accuracy, F1, and MCC scores, while the combination with GNB achieved the highest recall score. The highest overall precision was obtained by the combination of SMOTE with GB.Table 3 Comparison of model performance with data block constructure.In addition to common oversampling methods, we also compared other, more sophisticated oversampling schemes. ModCGAN39,40 employs a different approach to ensure the quality of synthetic samples by training an MLP classifier after the GAN is trained to assess whether the generated samples meet the required quality. Similarly, ACGAN41 builds on CGAN by forcing the discriminator to reconstruct the auxiliary information of CGAN, thereby regulating the quality of the synthetic samples. We integrated these methods with data blocks to compare the impact of synthetic sample quality on predicting RNA methylation-related genes.Although ModCGAN and ACGAN performs similarly to our method across five metrics and five classifiers, our method shows a 1–5% improvement on most metrics. The key to ModCGAN lies in using an external classifier to distinguish between real samples and noise, thereby regulating the generator’s outputs. However, ModCGAN assumes that noise follows only a joint distribution of uniform and Gaussian distributions. In cases involving large datasets, as in this study, larger negative samples may overlap with the Gaussian or uniform distribution, leading the external classifier to exclude some qualified synthetic samples. Although this study does not use GANs to generate majority class samples, a small number of positive samples might also overlap with the predefined noise distribution. Therefore, it is more reasonable to use the distinguishability between synthetic and real samples as a standard for assessing the quality of synthetic samples.It is worth noting that the WGAN-GP oversampling method, when combined with different classifiers, can achieve the highest MCC and F1 scores for each respective classifier. Since the MCC metric is commonly used to evaluate binary classification models, particularly for unbalanced datasets, this suggests that WGAN-GP generates more diverse samples, thereby reducing the classifier’s bias towards the majority class.GO terms and KEGG pathway enrichment analysesWith \(\alpha\) set to 0.4, each classifier retained 116 models. The average score of these models for each gene was then calculated as the classifier’s final score. Subsequently, high-confidence genes identified by each classifier were utilized for GO terms enrichment analysis. High-confidence genes are defined as the top 1% of genes ranked by confidence. Based on the gene quantity in our dataset, each classifier typically has 270 high-confidence genes. However, since GNB tends to overestimate confidence for many genes, we considered the top 1500 genes in GNB’s confidence rankings as high-confidence (beyond rank 1500, other classifiers typically assign confidence levels below 0.5). Figure 7 compares the top 20 GO terms ranked by each classifier. Overall, the predictions from the five classifiers exhibit strong consistency. GO terms enrichment analysis indicates that high-confidence genes are significantly enriched for RNA methylation functions. The terms with the highest enrichment include methylation, RNA modification, RNA splicing, and RNA localization. Other terms related to rRNA, mRNA, tRNA, and ncRNA are also present in the enrichment analysis, with over 200 genes enriched for these GO terms. This confirms that the predicted genes are closely associated with RNA methylation process.Supplementary Figure 2 shows the KEGG pathway enrichment analysis of the top 1% confidence genes predicted by five classifiers. Despite GNB’s tendency to assign high confidence to most genes, the pathway enrichment analyses of the five models showed strong similarity. All five enrichment analysis reports highlighted pathways strongly related to RNA methylation, such as Spliceosome, mRNA surveillance pathway, RNA polymerase, and RNA degradation. The enrichment counts ranged from 20 to 40, and all P-values were all below 0.05, indicating high confidence in these items. Notably, the KEGG enrichment report from the GB classifier erroneously included a pathway related to Coronavirus disease – COVID-19, which did not appear in other classifiers. The count and P-value for this item were much lower than those related to RNA methylation, further indicating that the model consistently predict genes involved in RNA methylation.Fig. 7GO terms enrichment analyses of high-confidence genes.Functional insights into the new predictionsTo gain deeper insights into the role of newly predicted genes in the RNA methylation process, we selected the top 1% confidence genes (predicted by SVM) and constructed a PPI network using the STRING database42. This network included 188 newly predicted genes and 76 known genes related to RNA methylation. The STRING database constructed a densely interactive PPI network for these genes, comprising 4093 edges. Using the Louvain43community detection algorithm, we identified six tightly connected communities within the PPI network, as shown in Supplementary Figure 3.Community 0 consists of 71 genes, including 48 known RNA methylation-related genes and 23 newly predicted genes. Functional analysis indicates significant enrichment such as tRNA N2-guanine methylation (GO:0002940, P = 3.53e−06), rRNA 2-O-methylation (GO:0000451, P = 3.53e−06), tRNA N1-guanine methylation (GO:0002939, P = 0.00025), tRNA (guanine-N7)-methylation (GO:0106004, P = 0.0180), and rRNA (guanine-N7)-methylation (GO:0070476, P = 0.0180).Community 1 consists of 18 genes, including only one known RNA methylation-related gene: FDXACB1. Although as it is, functional enrichment analysis shows significant enrichment such as histidyl-tRNA aminoacylation (GO:0006427, P = 0.0018), phenylalanyl-tRNA aminoacylation (GO:0006432, P = 0.0044), tRNA aminoacylation for mitochondrial protein translation (GO:0070127, P = 0.0154), aminoacyl-tRNA ligase activity (GO:0004812, P = 1.19e−13), and RNA adenylyltransferase activity (GO:1990817, P = 0.0046).Community 2 consists of 22 genes, including 20 newly predicted genes. HSD17B10 and RSAD1 in this community are known RNA methylation-related genes. Functional enrichment analysis suggests significant enrichment such as tRNA 5-leader removal (GO:0001682, P = 1.49e−16), tRNA 5-end processing (GO:0099116, P = 1.41e−18), tRNA splicing via endonucleolytic cleavage and ligation (GO:0006388, P = 1.48e−06), and tRNA-specific ribonuclease activity (GO:0004549).Community 3 consists of 80 genes, including 7 known RNA methylation-related genes (CMTR1, CMTR2, LARP7, MEPCE, RNGTT, RNMT, TGS1) and 73 newly predicted genes. GO enrichment analysis shows significant enrichment such as cap1 mRNA methylation (GO:0097309, P = 0.0280), 7-methylguanosine RNA capping (GO:0009452, P = 1.02e−08), RNA 5-cap (guanine-N7)-methylation (GO:0106005, P = 0.433), 7-methylguanosine mRNA capping (GO:0006370, P = 5.86e−07), mRNA (nucleoside-2-O-) methyltransferase activity (GO:0004483, P = 0.0264), mRNA methyltransferase activity (GO:0008174, P = 0.0099), snRNA binding (GO:0017069, P = 1.69e−06), and O-methyltransferase activity (GO:008171, P = 0.0479).Community 4 consists of 16 genes, including 12 known RNA methylation-related genes and 4 newly predicted genes. Functional enrichment analysis reveals significant enrichment such as snRNA (adenine-N6)-methylation (GO:0120049, P = 0.0014), S-adenosylmethionine biosynthetic process (GO:0006556, P = 0.0033), mRNA methylation (GO:0080009, P = 9.06e−17), mRNA (2-O-methyladenosine-N6-)-methyltransferase activity (GO:0016422, P = 6.17e−06), RNA (adenine-N6-)-methyltransferase activity (GO:0008988, P = 6.17e−06), mRNA methyltransferase activity (GO:0008174, P = 5.16e−07), and RNA N6-methyladenosine methyltransferase complex (GO:0036396, P = 6.55e−16).Community 5 consists of 57 genes, including 6 known RNA methylation-related genes (CEBPZ, DIMT1, EMG1, FBL, FBLL1, TFB2M) and 51 newly predicted genes. Functional enrichment analysis indicates significant enrichment such as Box H/ACA RNA 3-end processing (GO:0000495, P = 0.0074), U5 snRNA 3-end processing (GO:0034476, P = 0.0119), U1 snRNA 3-end processing (GO:0034473, P = 0.0119), histone-glutamine methyltransferase activity (GO:1990259, P = 0.0073), rRNA (adenine-N6,N6-)-dimethyltransferase activity (GO:0000179, P = 0.0158), snoRNA binding (GO:0030515, P = 9.05e−11), 3-5-exoribonuclease activity (GO:000175, P = 8.02e−09), rRNA methyltransferase activity (GO:0008649, P = 4.28e−06), catalytic activity acting on RNA (GO:0140098, P = 6.20e−20), and N-methyltransferase activity (GO:0008170, P = 0.0242).Presuming that the top 20 false-positive genes, based on confidence ranking, represent new predictions of the model, Existing researches reveal that these genes play critical roles in various RNA- or methylation-related biological processes. DDX56 (score = 0.979), a member of the DEAD box RNA helicase family, plays a crucial role in several RNA-related biological processes44. METTL18 (score = 0.971) has been reported to be closely related to METTL1745. METTL18 is a methyltransferase, while METTL17 regulates mitochondrial ribosomal RNA modification45. PUS3 (score = 0.969) is involved in RNA pseudouridylation in humans46 and has also been studied in relation to methylation and sepsis47. PRPF4B (score = 0.967), a member of the Clk/Sty kinases family, encodes the pre-mRNA processing factor 4 kinase (PRP4K)48. PAPOLG (score = 0.965), a poly(A) polymerase (PAP) family member, plays a key role in mRNA stability and translational modifications49. EXOSC2 (score = 0.962) provides catalytic activity to the RNA exosome50. PPIG (score = 0.955) interacts directly with the phosphorylated RNA Pol II carboxy-terminal domain (CTD) via its RS domains both in vivo and in vitro51.TRUB2 (score = 0.953) is involved in mitochondrial mRNA pseudouridylation , regulating 16S rRNA and mitochondrial translation52. DHX38 (score = 0.946) encodes RNA helicase PRP16, essential for disrupting the U2-U6 helix I between the first and second catalytic steps of the splicing53. SF3B1 (score = 0.945) plays a central role in the RNA splicing function of SF3b54. UTP23 (score = 0.945) is a trans-acting factor involved in the early assembly of 18S rRNA55. CHTOP (score = 0.944) binds competitively with arginine methyltransferases PRMT1 and PRMT5, promoting the asymmetric or symmetric methylation of arginine residues56. DDX47 (score = 0.944) plays a role in pre-rRNA processing57. RRP9 (score = 0.942) is involved in the maturation of ribosomal RNA, ensuring proper ribosome formation and function. HARS2 (score = 0.941) encodes the mitochondrial histidyl-tRNA synthetase (mt-HisRS)58. UPF2 (score = 0.940) gene plays a role in mRNA degradation59. DDX46 (score = 0.938) recruits ALKBH5, an eraser of the RNA modification N6-methyladenosine (m6A), through its DEAD helicase domain to demethylate m6A-modified antiviral transcripts60.Notably, recent research61 reported that CSTF2T(score = 0.941)’s paralog, CSTF2, acts as a mediator of m6A deposition, thereby regulating mRNA m6A modification. This suggests that CSTF2T may play a similar role in the m6A methylation process.Influence of highly informative featuresIn this study, we used a dataset (full set) filtered for zero values and variance. This dataset consists of 26,936 genes and 1517 features, including 62 GO annotations, 1110 GTEx, 39 TISSUES, 162 PathCommons,108 HPA, one InterPro, and 35 BioGPS features. While GO annotations and InterPro features are informative, they may bias the model’s predictions towards pre-existing methylation-related annotations. To investigate the impact of GO and InterPro features on the model, we further removed all GO and InterPro features from the dataset, resulting in a reduced set with 1454 features. We retrained the classifiers (started from generator’s training) using the reduced set and calculated scores for all genes across different models. The probability score distribution curves are shown in Supplementary Figure 4. The prediction trends across all five classifiers were very similar: the classifiers tended to assign low confidence to most genes. In the low confidence interval (score < 0.5), the peak for the reduced set without GO and InterPro features shifted slightly to the right. This suggests that the reduced set slightly increased confidence in the low confidence interval, indicating that GO and InterPro features prevented classifiers from blindly assigning high confidence to most genes. As shown in Supplementary Figure 5 and Supplementary Table 1, although the test results declined after training with the reduced set, GO enrichment analysis revealed a strong association with RNA methylation. This suggests that, while the reduced set may lower evaluation metrics, it does not strongly bias the model’s predictions.Additionally, we conducted a feature importance analysis based on the LR model. Supplementary Figure 6 shows the importance rankings of the top 50 features in both the full set and the reduced set. In the full set, GO (26/50) and InterPro (1/50) features accounted for over 50% (27/50) of the top positions. Their importance scores were significantly higher than those of GTEx and HPA, indicating that the model was more heavily influenced by GO and InterPro features than by other types. In the top 50 rankings of the reduced set (without GO and InterPro features), GTEx and HPA appeared most frequently. This is likely due to that GTEx and HPA provide crucial insights into gene expression across various tissues, which can help identifying tissue-specific patterns of RNA methylation.

Data augmentation based on the WGAN-GP with data block to enhance the prediction of genes associated with RNA methylation pathways

Optimizing resource utilization for large scale problems through architecture aware scheduling

An enhanced machine learning algorithm for type 2 diabetes prognosis with a detailed examination of Key correlates

Chemistry wordoku #067 | Puzzle

Tight House race in Pennsylvania could affect federal science spending

Quick chemistry crossword #060

Hot Topics

Optimizing resource utilization for large scale problems through architecture aware scheduling

An enhanced machine learning algorithm for type 2 diabetes prognosis with a detailed examination of Key correlates

Chemistry wordoku #067 | Puzzle

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Optimizing resource utilization for large scale problems through architecture aware scheduling

An enhanced machine learning algorithm for type 2 diabetes prognosis with a detailed examination of Key correlates

Chemistry wordoku #067 | Puzzle

Tight House race in Pennsylvania could affect federal science spending

Popular Articles

Optimizing resource utilization for large scale problems through architecture aware scheduling

An enhanced machine learning algorithm for type 2 diabetes prognosis with a detailed examination of Key correlates

Chemistry wordoku #067 | Puzzle