BigSur – leveraging gene correlations in single cell transcriptomic data

Scientists are using single-cell RNA sequencing (scRNAseq) to push the boundaries of what we can learn about individual cells and how they function. scRNAseq allows researchers to study gene expression in thousands of cells at once, providing a wealth of information about how different genes are turned on or off in each cell. However, analyzing all of this data can be quite tricky. There’s a lot of “noise”—random fluctuations or errors—that can make it hard to tell what’s important from what’s just background chatter.
The Challenge with scRNAseq Data
When scientists use scRNAseq to study cells, they face several challenges. First, there’s cell heterogeneity, which refers to the natural differences between cells, such as in their gene expression profiles. This is what researchers are actually interested in. But on top of that, there’s transcriptional noise, which is like random background static that happens as cells temporarily turn genes on and off at different rates. Lastly, there’s sampling error, or Poisson noise, which happens because the process of collecting and sequencing RNA doesn’t always capture every single RNA molecule perfectly.
Researchers have been trying to develop tools that help them filter out this noise so they can focus on the real, meaningful differences between cells. However, many current methods rely on complicated rules or “thresholds” that are chosen somewhat arbitrarily, which can lead to inaccuracies in the results. These methods also often require “normalization,” a step where the data is adjusted to account for differences in the number of genes detected in each cell, but this step can distort the data, especially when the gene activity in certain cells is very low.
New Method BigSur
BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads) is a new method developed by scientists at the University of California, Irvine to address these challenges. Instead of relying on normalization, BigSur takes a different approach. It assumes that scRNAseq data is made up of three things: the real differences between cells (cell heterogeneity), the random noise of gene activity (transcriptional noise), and errors from the sampling process (Poisson noise). By using these assumptions, BigSur calculates p-values—a statistical measure that helps determine how likely it is that a result is due to chance—without introducing the distortions caused by normalization.
BigSur also improves the way researchers identify important features for clustering cells into groups based on their gene expression patterns and finding connections between genes, known as gene-gene correlations. These correlations help scientists figure out how genes work together in biological processes.
Testing BigSur
The researchers behind BigSur tested it on simulated data, which is data that’s been created to mimic real-life scRNAseq experiments but with known outcomes. This allowed them to see if BigSur could correctly identify the relationships between genes, even when those relationships were weak or subtle. The results were promising—BigSur was able to find significant gene-gene correlations that other methods might have missed.
Then, the researchers applied BigSur to real data from human melanoma cells. Using BigSur, they were able to identify thousands of gene-gene correlations. When they grouped these correlated genes together, they found that the clusters aligned with known cellular components and biological processes. Even more exciting, they discovered new, potentially important relationships between genes that hadn’t been noticed before.
Grouping into “metacells” creates false positive correlations

A–C Analysis of synthetic, uncorrelated data. A Pipelines for data processing. Each box denotes a step at which a Pearson correlation coefficient (PCC or PCC′) was calculated, with the color of the box corresponding to the colors used in the following plots. B Histograms of correlation coefficients obtained from the data sets in panel (A). Arrows show the thresholds above which observed correlations were judged statistically significant. C Numbers of correlations in panel (B) that were judged to be significant (p < 0.02). D–F Analysis of melanoma cell line data. D Pipeline for data processing. The colors of each box correspond to the colors in panels (E, F). E Number of correlations judged significant in data sets in D. Darker shading denotes the negative correlations; lighter are positive correlations. F Enrichment for known protein–protein interactions among the correlations shown in (E). The value of n in each case gives the absolute number of protein–protein interactions. Of the 12,129 interactions detected by BigSur, 11,373 were also detected using grouping of log-normalized data and 10,773 were detected using grouping of modified corrected Pearson residuals (10,324 were shared among all three)
Why BigSur is Important
BigSur’s approach opens up new possibilities for scRNAseq research. By providing a statistically grounded method for identifying gene-gene correlations, it allows scientists to gain deeper insights into gene regulatory networks—how genes control and influence each other in cells. This could lead to better understanding of diseases like cancer, where changes in gene regulation often play a key role.
In short, BigSur helps scientists cut through the noise in scRNAseq data, making it easier to identify important gene interactions. With this tool, researchers can explore new frontiers in cell biology, uncovering how cells function and paving the way for future discoveries in areas like personalized medicine and gene therapy.
Availability – R and Mathematica implementations are available under the Lander lab profile on GitHub (https://github.com/landerlabcode/).

BigSur – leveraging gene correlations in single cell transcriptomic data

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Delineating cell types with transcriptional kinetics

Cornell researchers develop machine-learning diagnostic models that use cell-free molecular RNA

Hot Topics

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Delineating cell types with transcriptional kinetics

Popular Articles

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’