Enhancing RNA-seq analysis by addressing all co-existing biases


RNA sequencing (RNA-seq) helps scientists study which genes are active in cells by reading the RNA molecules they produce. However, RNA-seq data often contains biases—errors that arise from sample preparation or sequencing technology. These biases can skew the results, making it harder for researchers to accurately interpret gene activity. To solve this problem, scientists at the Chinese Academy of Sciences have developed a new framework called Minimum Free Energy–based Gaussian Self-Benchmarking (MFE-GSB). This framework uses mathematical models to detect and correct biases, improving the quality of RNA-seq data.
When scientists perform RNA-seq, they break RNA into smaller sequences called k-mers—short stretches of nucleotides. Ideally, these k-mers would be evenly represented in the data, meaning every stretch of RNA would appear in proportion to how often it is found in the original sample. In reality, the k-mers are not evenly distributed. Some appear too often, while others are underrepresented, due to factors like sample preparation or sequencing errors. These imbalances are what we call biases. Over time, these small errors can accumulate, affecting the final interpretation of the data. This is where MFE-GSB comes in.
MFE-GSB fixes these biases by using two types of models to compare the data: a model with uniform k-mer distribution (what the ideal data should look like) and the observed RNA-seq data with real-world biases and uneven k-mer counts. The goal is to match the observed data to the ideal model as closely as possible. The framework uses a Gaussian function—a mathematical function that describes how values are distributed around a mean, often represented by the familiar bell curve in statistics. In the MFE-GSB framework, the mean and standard deviation are calculated from the ideal k-mer model, and these values are used to adjust the real RNA-seq data so that it matches the expected distribution. This process corrects biases at the smallest level, the single k-mer level, ensuring the data becomes more reliable for further analysis.
Overview of an MFE-GSB approach for adjusting natural transcript bias in sequencing data analysis

(a) The MFE-GSB method refines k-mer modeling counts from natural transcripts by sorting them via MFEs for fitting into a Gaussian model. This approach sets core parameters to accurately adjust 50-mer sequencing counts, revealing inherent biases through observed discrepancies between predicted and actual counts. (b) The examination of sequencing counts, GC content, and MFE values for 50-mer starting from the 5′ end of the human USF2–201 transcript (ENST00000222305). Additionally, the model displays the distribution of the 50-mer with an assumption of even distribution. (c) The linear regression analysis conducted on 50-mer sequences from USF2 reveals an inverse linear relationship between the 39 identified types of GC content and the 241 variants of the MFE. (d) The aggregate counts from 241 unique MFE drawn from the even 50-mer distribution data have been analyzed through fitting with a Gaussian distribution function. This analysis yields outcomes including the mean, SD, and the coefficient of determination (R^2). (e) The sequencing counts of 50-mer sequences, categorized by their MFE values in the actual sequencing data, were aligned to a Gaussian distribution function defined by the parameters set forth in (d). This process entailed applying the required calibration changes to each MFE category, as denoted by the inclusion of directional arrows. (f) A comparative analysis of the individual count distribution illustrates the original versus GC-content and MFE-based calibrated sequencing counts for individual 50-mers across the transcript.
The researchers validated the MFE-GSB framework on two types of data: engineered RNA constructs (designed RNA samples) and human tissue samples (real biological data). The results showed that MFE-GSB is highly effective across both simple and complex datasets, making it a powerful tool for RNA-seq studies.
By correcting biases in RNA-seq data, MFE-GSB helps researchers obtain more accurate and reliable results, which is essential for studies in fields such as cancer research, developmental biology, and drug discovery. The ability to adjust data at the single k-mer level allows scientists to confidently interpret their results, leading to deeper insights and more meaningful discoveries. The MFE-GSB framework represents a significant advance for RNA sequencing, providing researchers with a robust tool to improve data accuracy and advance our understanding of complex biological processes.
Availability – The code used for data analysis is available at https://github.com/QiangSu/MFE-GSB.

Hot Topics

Related Articles