Leveraging gene correlations in single cell transcriptomic data | BMC Bioinformatics

The significance of gene–gene correlationsThe statistical significance of correlations is rarely discussed because, for many common kinds of data—those that are continuous and at least approximately normal in distribution—the magnitude of correlation and its significance are related in a simple way that depends only on the number of measurements, and not the data distributions. Owing to Fisher [42], for any Pearson correlation coefficient (PCC), the p value (probability of observing |PCC|≥ x by chance) may be estimated as \(Erfc\left[ {\sqrt {\left( {n – 3} \right)/2}\ arctanh\left( {\left| x \right|} \right)} \right]\), where n is the number of samples, and Erfc is the complement of the Error function (we refer to this expression henceforth as the “Fisher formula”).Because scRNAseq data are both discrete and generally not normally distributed, p values obtained using the Fisher formula cannot be accepted as accurate, but just how far off will they be? Fig. 1 explores that question through simulation. Assuming Poisson-distributed data, and gene expression vectors of 500 cells in length, the formula does quite poorly for vectors of mean < 1, i.e., where the expected proportion of zeros exceeds 37%, assigning p values that are too low for positive correlations, and too high for negative ones (Fig. 1A). For distributions with mean ≥ 1 (fewer than 37% zeros on average), the formula does reasonably well down to p values as low as 10−4 but deviates progressively thereafter. This degree of accuracy would be a problem for any genome-wide analysis of correlations: To analyze pair-wise correlations among m genes one must test m(m − 1)/2 hypotheses. With values of m often ≥ 12,000, this amounts to > 7 × 108 simultaneous tests, such that statistical significance of any single observation could potentially require p as low as 10−9, a value for which we may estimate, by extrapolation, that the Fisher formula is highly inaccurate even for genes with mean expression = 1.Fig. 1Relationship between Pearson correlation coefficient, vector sparsity and p value. The panels compare p values determined empirically by correlating 50,000 pairs of random, independent vectors of length 500 with p values predicted by the Fisher formula. A Data were independent random variates from Poisson distributions with means as indicated. The dashed line shows the output of the Fisher formula. B Even for vectors drawn from a distribution with mean = 1, the Fisher formula significantly mis-estimated p values smaller than 10−4. C Poor performance of the Fisher formula is worsened when data are drawn from a Poisson-log-normal distribution, rather than a Poisson distribution (in this case the underlying log-normal distribution had a coefficient of variation of 0.5). D–E Data were simulated under the scenario that gene expression is the same in every cell, but due to differences in sequencing depth, observed gene expression varies according to the depths shown in the inset to panel (D). In panel (D), true gene expression was adjusted so that observed gene expression after normalization would have a mean of 1, and the Pearson correlation coefficients (PCCs) obtained by correlating randomly chosen vectors are shown. Panel D plots empirically derived p values as a function of PCC, whereas panel E displays histograms of PCCs. Compared with data that do not require normalization, associated p values from normalized data are even more removed from the predictions of the Fisher formulaThe simulations in Fig. 1A, B assume Poisson-distributed data but, as is often pointed out, scRNAseq data are usually over-dispersed relative to the Poisson distribution (more on this below). As shown in Fig. 1C, adding in such additional variance causes simulated data to deviate even further from the Fisher formula.Further problems arise when considering that scRNAseq data always come from collections of cells with widely varying total numbers of UMI. Depending on the platform, such “sequencing depth” can vary over orders of magnitude, which is why normalization is usually considered a necessary early step in data analysis. Without normalization, it is obvious that many spurious gene–gene correlations would be detected, as any difference in sequencing depth between cells would, if not corrected for, induce positive correlation across all expressed genes.As one might expect, normalizing individual reads by scaling them to each cell’s sequencing depth eliminates this bias, restoring the expected value of PCC under the null hypothesis to zero. Yet normalization does not restore the distribution of PCCs to what it would have been had all cells been sequenced equally. The consequences can be dramatic, as shown in Fig. 1D, E, where we simulate a case in which “true” gene expression is the same in each of 500 cells, but observed gene expression is a Poisson random variate from a mean that was scaled by a factor chosen from a distribution of cell-specific sequencing depths similar to what one might observe in a typical scRNAseq experiment, using the 10X Chromium platform (inset, Fig. 1D). Despite mean gene expression being ~ 1, the relationship between PCC and p value more closely resembles the case (in the absence of sequencing depth variation) where the mean is 0.01 (Fig. 1A). This is surprising, given that the average fraction of zeros in the normalized vectors was only 0.68 (which, for Poisson-distributed data, would be expected to occur at a mean expression level of 0.39). In short, for gene expression data resembling what is typically obtained in scRNAseq, the Fisher formula is highly unsuitable, for most pairs of genes, for estimating the significance of correlations.An analytical model for the distribution of correlationsThe data in Fig. 1 indicate that the relationship between PCCs and p values is highly sensitive both to data structure and procedures intended to “correct” for technical variation. Because of this, we were concerned that a suitable null model for the distribution of correlations might be difficult to estimate empirically, especially when the true biological variation in most datasets is unknown. We therefore turned to constructing a null model analytically, attempting to account for known sources of variation (beside meaningful biological variation). The three sources considered were (1) variation introduced by imperfect normalization; (2) technical variation due to random sampling of transcripts during library preparation and sequencing; and (3) variation due to stochasticity of gene expression. The last of these cannot properly be called technical variation—fluctuating gene expression is a biological phenomenon—but like technical noise, gene expression fluctuation is usually an unwanted source of variation, and its effects need somehow to be suppressed.With regard to the first source, we follow [43] in correcting not the gene expression data points themselves, but rather their Pearson residuals. Traditionally, the Pearson residual \(P_{ij}\) for cell i and gene j, is defined as$$P_{ij} = \frac{{x_{ij} – \mu_{j} }}{{\sqrt {\mu_{j} } }}$$where \(x_{ij}\) is the gene expression value for cell i and gene j, and \(\mu_{j}\) is the mean expression of gene j averaged over all cells. As the Pearson residual is mean-centered, its expectation value, \(E[P_{ij} ]\), is zero; thus, the average of Pearson residuals for a large number of \(x_{ij}\) drawn from a single distribution should approach zero. The average of squares of Pearson residuals can be seen, by inspection, to approach the variance divided by the mean of the distribution from which the \(x_{ij}\) derive. Variance divided by mean is also known as the Fano factor and is often used to assess whether data are consistent with a Poisson distribution since, for any Poisson distribution regardless of mean, \(E[P_{ij}^{2} ] = 1\).Pearson residuals may also be used to construct PCCs, which are commonly defined in terms of variances and covariances, but may be equivalently expressed as:$${\text{PCC}}_{a \times b} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} P_{ia} P_{ib} }}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} P_{ia}^{2} \mathop \sum \nolimits_{i = 1}^{n} P_{ib}^{2} } }} = \frac{1}{{\left( {n – 1} \right)\sqrt {\phi_{a} \phi_{b} } }}\mathop \sum \limits_{i = 1}^{n} P_{ia} P_{ib}$$where \({\text{PCC}}_{a \times b}\) is the Pearson correlation coefficient between genes a and b, n the number of cells, and \(\phi_{a}\) and \(\phi_{b}\) represent the Fano factors for genes a and b, respectively, i.e.$$\phi_{j} = \frac{1}{n – 1}\mathop \sum \limits_{i = 1}^{n} P_{ij}^{2}$$Since \(E[P_{ij} ]\) = 0 for any gene, it follows that \(E[{\text{PCC}}_{a \times b} ] = 0\), as long as all the expression values for gene a are drawn from a single distribution, and those for b are independently drawn from a single distribution.However, in scRNAseq, unequal sequencing depth means that the expression values for any given gene are generally not drawn from a common distribution, but rather from one that is different for each cell. Interestingly, dividing \(x_{ij}\) in a Pearson residual by an appropriate scaling constant—i.e. normalizing the data—will restore \(E[P_{ij} ]\) to 0, but will not restore the higher moments of \(P_{ij}\), e.g., \(E[P_{ij}^{2} ] \ne 1.\) To capture the correct second moment, one must scale the value of \(\mu_{j}\) inside each Pearson residual, rather than scaling \(x_{ij}\). As [43] have pointed out, we can define a separate \(\mu_{ij}\) for each cell and gene:$$\mu_{{{\text{ij}}}} = \frac{{\sum\nolimits_{j} {x_{{{\text{ij}}}} } \sum\nolimits_{i} {x_{{{\text{ij}}}} } }}{{\sum\nolimits_{ij} {x_{{{\text{ij}}}} } }}$$and consequently, define a corrected Pearson residual as $$P_{ij} = \frac{{x_{{{\text{ij}}}} – \mu_{{{\text{ij}}}} }}{{\sqrt {\mu_{{{\text{ij}}}} } }}.$$Although this transformation does not recover moments of the Pearson residual beyond the first two, it provides a principled alternative to traditional normalization. Moreover, by permitting calculation of a corrected Fano factor that has the appropriate expectation value under the assumption of Poisson distributed data, it can be used to test that assumption, in real data.Source #2 refers to technical variation due to random, independent sampling of discrete numbers of transcripts. Although the sampling process in scRNAseq involves several discrete steps, including cell lysis, library preparation and DNA sequencing, several groups have argued that, at least at modest to low expression levels, simple Poisson “noise” can reasonably model the variation derived from these processes [35,36,37,38]. We accept this assumption here, but note that, in the following derivations, the Poisson distribution could just as easily be replaced by another known distribution, if it were adequately justified.Finally, source #3 refers to the fact that “equivalent” cells are usually only equivalent in a time-averaged sense, i.e., transcript numbers will fluctuate around some mean value. Both theory and observation support the conclusion that these fluctuations can be large [30, 44,45,46]. The actual magnitude seems to differ for different categories of genes, but data from single-molecule transcript counting [44] suggest that, for most genes, the distribution of transcripts typically is approximately log-normal (consistent with the theoretical work of [47]), with a coefficient of variation in the range of 0.2–0.6.Thus, even if library synthesis and sequencing performed identically across cells, one should not expect to observe Poisson-distributed reads. Under the reasonable assumption that gene expression fluctuations and sampling are independent, the variance of the combined process should be the sum of the variances of the composing processes. Since both processes have the same mean, we can re-state this thusly: the Fano factor of the combined process should be the sum of the Fano factors of the composing processes. One can then use this fact to further adjust the Pearson residual, and subsequently the Fano factor. Specifically, we define a “modified corrected Pearson residual” as:$$P_{{{\text{ij}}}}^{\prime } = \frac{{x_{ij} – \mu_{ij} }}{{\sqrt {\mu_{ij} \left( {1 + c_{j}^{2} \mu_{ij} } \right)} }}$$
(1)
where c represents the coefficient of variation of gene expression for gene j. Accordingly, the expectation value for \(\left( {P_{{{\text{ij}}}}^{\prime } } \right)^{2}\) becomes$$\frac{{\left\langle {\frac{{(x_{ij} – \mu_{ij} )^{2} }}{{\mu_{{{\text{ij}}}} }}} \right\rangle }}{{\left( {1 + c_{j}^{2} \mu_{{{\text{ij}}}} } \right)}}$$which resembles a Fano factor divided by \(\left( {1 + c_{j}^{2} \mu_{{{\text{ij}}}} } \right)\). Since \(c^{2} \mu_{{{\text{ij}}}}\) can alternatively be written as \(\frac{{\sigma_{{{\text{ij}}}} }}{{\mu_{{{\text{ij}}}} }}\), where s is the standard deviation of gene expression fluctuations, the term \(\left( {1 + c_{j}^{2} \mu_{{{\text{ij}}}} } \right)\) is simply the sum of the Fano factors for Poisson sampling (unity) and gene expression fluctuation (\(c_{j}^{2} \mu_{{{\text{ij}}}}\)). Dividing by that sum essentially removes the additional variance due to gene expression noise from the expectation value of \(\left( {P_{{{\text{ij}}}}^{\prime } } \right)^{2}\), restoring that value to 1.In this way, one may define a “modified corrected Fano factor” \(\phi^{\prime }\) equal to the expectation value of \(\left( {P_{{{\text{ij}}}}^{\prime } } \right)^{2}\). Likewise, we may use \(\phi^{\prime }\) to define a modified corrected Pearson correlation coefficient, PCC′:$${\text{PCC}}_{a \times b}^{\prime } = \frac{1}{{\left( {n – 1} \right)\sqrt {\phi_{a}^{\prime } \phi_{b}^{\prime } } }}\mathop \sum \limits_{i = 1}^{n} P_{ia}^{\prime } P_{ib}^{\prime } .$$
(2)
Notice that \(\phi^{\prime }\) provides a measure of the degree to which a gene’s expression is more variable than expected by chance, and PCC′ provides a metric by which gene-pairs are judged more positively or negatively correlated than expected by chance, correcting in both cases for unequal sequencing across cells (without normalizing data) and the expected noisiness of gene expression.To use these statistics in practice, one needs to know not only their expectation values but their full distributions under the null hypothesis. Constructing those analytically requires not only the coefficient of variation of gene expression noise, c, but the full distribution of that noise, which in the absence of information to the contrary, we will take to be log-normal [47], noting that any other distribution could as easily be substituted in the following discussion.We thus treat \(x_{ij}\) as a Poisson random variable from a distribution whose mean is a log-normal random variable with a coefficient of variation of c (we refer to this compound distribution as Poisson-log-normal). Although an analytical form for the probability mass function of the Poisson-log-normal distribution is not known, we may derive analytical forms for an arbitrary number of its moments, as a function of µ (the mean) and c (see Methods). Thus, one can calculate for every cell i and gene j, given the observed values of µij, the moments of the expected distributions of the \(P_{{{\text{ij}}}}^{\prime }\) under the null hypothesis, and subsequently those for the \(\left( {P_{{{\text{ij}}}}^{\prime } } \right)^{2}\). From there one can calculate the moments of any number of products and sums of \(P_{{{\text{ij}}}}^{\prime }\) and \(\left( {P_{{{\text{ij}}}}^{\prime } } \right)^{2}\), such that, eventually, the moments of \(\phi^{\prime }\) and PCC′ under the null hypothesis are ultimately obtained (see Appendix). Given a finite number of moments, one can estimate the tails of the distributions of these statistics (see Methods), allowing one to calculate the probability of extreme values of \(\phi^{\prime }\) and PCC′ arising by chance (p values).This method, which we refer to as BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), provides an approach for discovering genes that are significantly variable across cells (based on \(\phi^{\prime }\)), and gene pairs that are significantly positively or negatively correlated (based on PCC′), automatically accounting for the widely varying distributions of these statistics as a function of gene expression level and vector length (number of cells). The one free parameter in the method, c, is relatively constrained, as its average value (over all genes) can be estimated from a plot of \(\phi^{\prime }\) versus mean expression (see below). In this manner, one can avoid the use of arbitrary thresholds or cutoffs in deciding which genes are significantly “highly” variable (e.g., for dimensionality reduction and cell classification) and which genes are significantly positively and negatively correlated (e.g., to discover gene expression modules and construct regulatory networks).Performance on simulated dataIn Fig. 2 we simulate gene expression data for 1000 genes and 999 cells, under the null model described above (i.e., complete independence), varying “true” mean expression widely and uniformly over the genes, such that the most highly expressed genes average 3467 transcripts/cell and the most lowly expressed 0.0351 per cell. “Observed” gene expression values are then obtained by randomly sampling from a Poisson log-normal distribution with c = 0.5, in which the gene-specific mean is first scaled in each cell according to a pre-defined distribution of sequencing depth factors (chosen to mimic typical ranges of sequencing depth when using the 10X Chromium platform). The result is a set of gene expression vectors of length 999, with means varying between 0.001 and 231.Fig. 2Comparing uncorrected and modified corrected Fano factors and correlation coefficients. Random, independent, uncorrelated gene expression data was generated for 1000 genes in 999 cells, under the assumption that observations are random Poisson variates from a per-cell expression level that is itself a random variate of a log-normal distribution, scaled by a sequencing depth factor that is different for each cell (see methods). A Uncorrected (\(\phi\)) or modified corrected (\(\phi ^{\prime}\)) Fano factors are plotted as a function of mean expression level for each gene. Uncorrected factors were calculated either without normalization, or with default normalization (scaling observations by sequencing depth factors, learned by summing the gene expression in each cell). Uncorrected Fano factors were also calculated using SCTransform [48] as an alternative to default normalization. Modified corrected Fano factors were obtained by applying BigSur to unnormalized data, using a coefficient of variation parameter of c = 0.5. B Modified corrected Fano factors (\(\phi ^{\prime}\)) were calculated as in A, but using different values of c. The data suggest that an optimal choice of c can usually be found by examining a plot of \(\phi^{\prime}\) versus mean expression. C Empirical p values associated with uncorrected (PCC) or modified corrected (PCC′) Pearson correlation coefficients were calculated for pairwise combinations of genes in bins of different mean gene expression level (µ); examples are shown for four representative bins (both genes derived from the same bin). With increasing gene expression levels, the p value versus PCC relationship begins to approach the Fisher formula (dashed curve), but it does so much sooner for PCC′ than PCCAs shown in Fig. 2A, for most genes with mean expression greater than ~ 0.1 reads/cell, uncorrected Fano factors exceed 1, and rise linearly with expression level. Normalizing the data—scaling expression in each cell to the relative sequencing depth of that cell—reduces high-expression skewing somewhat, but also elevates the Fano factors for genes with low expression, driving them closer to 2. These results, in which the majority of genes display Fano factors greater than 1, which rise further for highly expressed genes, agree with the pattern most commonly seen in actual scRNAseq data. Values of the Fano factor for lowly expressed genes may be restored to near 1 by normalizing using SCTransform, an algorithm designed to correct for some of the variance-inflating aspects of normalization-by-scaling [48], but the presence of high Fano factors among the highly expressed genes persists.In contrast, if we calculate the modified corrected Fano factor, \(\phi^{\prime }\), for each gene, using c = 0.5, we see that values are now centered around 1 at all expression levels (Fig. 2A). Note that choosing different values of c produces consistent positive or negative skewing at mean gene expression values above 1 (Fig. 2B). Under the assumption that most genes in real data should not vary significantly across cells, one may therefore estimate the optimal choice of c for any data set by simply finding the value that minimizes such high-expression skewing of \(\phi^{\prime }\).Because it uses the moments of the Pearson residuals to calculate p values, BigSur assigns statistical significance to every gene’s \(\phi^{\prime }\). As expected, given that the data in Fig. 2A were random samples, no values of \(\phi^{\prime }\) were found to be statistically significant at a Bonferroni-corrected p value threshold of 0.05, or a Benjamini-Hochberg [49] false discovery rate of 0.05. Indeed, the lowest uncorrected p value associated with any of the 1000 genes in Fig. 2A was 0.001.Similarly, when the same synthetic data are analyzed for gene–gene correlations, one may compare the PCC values produced directly from normalized expression data with the PCC′ values produced (from unnormalized data) by BigSur. As expected, both procedures return a distribution of values with zero mean, but PCC values are more broadly distributed than PCC′. Figure 2C shows the frequency at which various values of the correlation coefficient arise when vectors with different mean gene expression are correlated with each other. Although significant skewing from the Fisher formula is apparent, especially at low values of gene expression, it is much greater for PCC than PCC′. Indeed, for moderately expressed genes (e.g., 1–10 unique molecular identifiers [UMI] per cell), only PCC′ returns values whose distribution is relatively insensitive to expression level.Performance on real dataTo characterize the performance of BigSur on real data, we used the droplet-based sequencing data of Torre et al. [50], obtained from a clonal isolate of a human melanoma cell line grown in culture. In this data set, the number of cells is large (8640), data were validated on a second sequencing platform as well as by single molecule FISH, and the broad distribution of sequencing depths was typical of droplet-based sequencing.We deliberately chose a clonal cell line because tissues always contain multiple cell types, i.e., groups of cells that express large sets of genes in cell-type specific ways. In such heterogeneous samples, genes that are associated with cell type identity will necessarily be strongly and densely correlated with each other; making the identification of correlations, in a sense, too easy—i.e., not a particularly good test of a method’s performance—and not particularly informative (one may expect to identify as correlated more or less the same genes one would find by clustering cells by gene expression and testing for differential expression between clusters).In contrast, the use of (ostensibly) homogeneous cells forces BigSur to operate on more subtle connections—for example, those involving fluctuations in function-specific gene regulatory networks—that cell clustering and differential expression would not easily detect.Accordingly, scRNAseq data from these cells were subjected to minimal pre-processing prior to analysis (see Methods), such that expression values were analyzed for 14,933 genes. Total numbers of UMI per cell varied greatly, ranging from 67 to 90,494, with 90% of cells containing between 666 and 9004 UMI.First, we compared (uncorrected) PCCs for all gene–gene pairs, calculated using default-normalized expression data (i.e., data scaled to total UMI per cell), with PCC′ values returned by BigSur (Fig. 3A, B). In panel B, frequencies are scaled logarithmically, to better display the distribution of large absolute values. BigSur associated a false discovery rate (FDR) of 2% to p values less than 1.15 × 10−4, at which threshold it detected 639,789 correlations, 350,466 of which were positive. For uncorrected PCCs, the same p value cutoff would translate, using the Fisher formula, to |PCC|> 0.041, which is displayed as a dotted line in Fig. 3B. Using this threshold, 1,484,156 correlations would be considered significant. Comparison of histogram shapes shows that use of uncorrected PCCs particularly inflates positive correlations, especially large ones, and under-counts negative ones, which is consistent with the results obtained using simulated data (Fig. 2C).Fig. 3Statistical significance of pairwise gene correlations in data from a clonal cell line. A, B Using scRNAseq data from a human melanoma cell line [50] (8640 cells × 14,933 genes), pairwise values of PCC were calculated from normalized data, and PCC′ from raw data. Histograms display the frequency of observed values (the logarithmic axis in B emphasizes low-frequency events). Notice in B how positive skewing, also seen in simulated data (Fig. 2), is less for PCC′ than PCC. Dashed lines in B show thresholds at which Fisher formula-derived p values would fall below 1.1 × 10.−4. C, D Scatterplots showing p values assigned by BigSur to pairs of genes within two representative sets of bins of gene expression (for all pairwise combinations see Figs. S1, S2/Additional files 3, 4). The abscissa shows PCC (panel C) and PCC′ (panel D). The ordinate gives the negative log10 of p values determined by BigSur, i.e., larger values mean greater statistical significance. Orange and gray shading indicate gene pairs judged significant by BigSur (FDR < 0.02). Blue and orange show gene pairs that would have been judged statistically significant by applying the Fisher formula to the PCC or PCC′, using the same p value threshold as used by BigSur. The blue region contains gene pairs judged significant by the Fisher formula only, while the unshaded region shows gene pairs not significant by either method. Numbers in the lower right corner are the total numbers of possible correlations (blue), statistically significant correlations according to the Fisher formula (green), and statistically significant correlations according to BigSur (red). E, F ROC curves assessing whether the overall performance of the Fisher formula—applied either to PCC or PCC′—can be adequately improved either by using a more stringent p value cutoff (E) or limiting pairwise gene-correlations to those involving only genes with mean expression above a threshold level, µ (F)To see how the discovery of correlations varied with gene expression level, we divided genes into 6 bins of different mean expression, and separately analyzed correlations between genes in all 21 possible combinations of bins. The full data are presented in Figs. S1 and S2 (Additional files 3, 4), with two representative panels in Fig. 3C, D. Each point represents a gene–gene pair. The value on the abscissa is either the uncorrected PCC (Figs. 3C and S1) or modified corrected PCC′ (Figs. 3D and S2), and the value on the ordinate is the − log10 p value, as determined by BigSur (i.e., the larger the number, the lower the p value). Shaded territories mark those data points that were judged statistically significant (FDR < 0.02) either by BigSur (orange and gray), or when p values were calculated by the Fisher formula (blue and orange; for further details see figure legend). The data confirm that the Fisher formula returns an excess of correlations, compared to BigSur, albeit less severely for PCC′ than for PCC. Examination of the full dataset (Figs. S1, S2) shows that many more truly significant correlations are found among highly expressed genes; and the Fisher formula performs worst when at least one of the pairs in a correlation is a lowly-expressed gene. Indeed, as gene expression becomes high (e.g., mean value > 1 UMI per cell for both genes in a pair), the distribution of p values calculated by BigSur for PCC′ begins to approximate the Fisher formula reasonably well (Fig. S2), with deviation apparent only for very low p values (− log10p > 10). This observation validates the accuracy of the method used by BigSur to recover p value distributions from the first five moments of the expected distributions of modified, corrected Pearson residuals (see “Methods” section).Assuming, for the sake of illustration, that the correlations judged significant by BigSur represent ground truth, we may then calculate levels of true- and false-positivity when p values are calculated by feeding either PCC or PCC′ into the Fisher formula. This enabled us to ask whether applying a more stringent p value cutoff, or thresholding gene expression (i.e., excluding genes below a certain expression level), might enable this simpler, formulaic approach to achieve an acceptable level of sensitivity and specificity. As shown by the receiver-operating characteristic (ROC) curves in Fig. 3E, the performance of uncorrected PCCs is exceedingly poor regardless of p value threshold, with false positives exceeding true positives at all values. PCC′ does better, but to control FDR to < 10%, one still loses the ability to detect > 85% of true positives.Arbitrarily thresholding gene expression performs somewhat better (Fig. 3F). For uncorrected PCCs, one must exclude all genes with average expression < 0.065 UMI/cell to control the FDR at 5%, which for this dataset means discarding 67% of all gene expression data, and in return recovering only 33% of true positives. PCC′ does much better here: we may recover 81% of true positives at an FDR of 5% by discarding the ~ 52% of genes with the lowest expression. While the exact numbers are likely to vary for different data sets, these observations suggest that, if one is willing to sacrifice the power to identify a substantial minority of correlations, feeding PCC′ (but not PCC) into the Fisher formula can represent an acceptable and computationally fast alternative to p value identification by BigSur.Clustering using correlationsAlthough BigSur found 639,789 statistically significant correlations in this dataset (about 0.57% of all possible pairwise correlations) the vast majority had quite small values of PCC′ (Fig. 3A, B), indicating that most statistically significant correlations were weak. To obtain a measure of correlation strength that could be compared across samples of different lengths (numbers of cells), we expressed each correlation in terms of an “equivalent” PCC, which is simply the PCC value that, for continuous, normally-distributed data of the same length, would have produced the same p value (by the Fisher formula). As shown in Fig. S3 (Additional file 5), only 4335 gene pairs displayed equivalent PCCs greater than 0.2 or less than − 0.2.Yet, despite the weak strength of most correlations, there are good reasons to believe them to be biologically relevant. One of the simplest comes from examining the frequency at which correlations arise among paralogous genes and genes that encode proteins that physically interact. It is known that gene paralogs are often co-regulated [51] leading us to expect paralog pairs to be enriched among bona fide correlations. It is also reasonable to expect that transcripts encoding proteins that interact will be co-expressed at least some of the time. As it happens, among the correlations identified by BigSur we observe ~ 12 fold enrichment for paralogs and ~ 7.5 fold enrichment for genes encoding physically-interacting proteins (see “Methods” section).To divide the full set of correlations into potentially interpretable groups, we used a random-walk algorithm [52] to identify subnetworks more highly connected internally than to other genes; we refer to these as gene communities. Most communities were of modest size, with 94 of the 96 largest containing between 4 and 392 genes each. However, the largest two contained 2160 and 1570 genes respectively, were very densely connected internally, and strongly anti-correlated with each other (Fig. 4A). These factors strongly suggest that these cells, despite being a clonal line, are heterogenous, falling into (at least) two distinct groups. Interestingly, the largest gene community contained virtually the entire set of mitochondrially-encoded mitochondrial genes, and the second largest contained virtually all protein-coding ribosomal genes (for both cytoplasmic and mitochondrial ribosomes). Using the top 50 most-highly connected genes (those with the largest positive equivalent PCCs) in each of these communities as features, we performed Leiden clustering on the modified corrected Pearson residuals of all 8640 cells, and easily subdivided them into three clusters of 5812, 1180 and 1638 cells, which we labeled as clusters 1,2 and 3, respectively (Fig. 4B; see “Methods” section).Fig. 4Clustering cells based on mitochondrial and ribosomal communities. A Correlations among the two largest gene communities detected by BigSur are shown. Vertices are genes, and edges—green and red—represent significant positive and negative correlations, respectively. Blue vertices represent members of the mitochondrially encoded gene community and brown vertices the ribosomal protein gene community (Labels have been omitted due to the large numbers of gene involved). B Using the top 50 most highly positively connected vertices in the two communities as features, cells were subjected to PCA and Leiden clustering; the three clusters recovered are displayed by UMAP. C, D After a second round of clustering of Cluster 1, the resulting four cell groups were analyzed for the distribution of expression of ribosomal protein-coding and mitochondrially-encoded genes, as a function of total UMI in each cell. The results show that the four clusters form distinct groups based on their relative abundances of ribosomal and mitochondrial genes. E Results of applying BigSur to each cluster. Note the large decrease in statistically significant correlations—especially negative correlations—in any of the clusters when compared with results obtained using all of the cells together (> 600,000 total correlations). This is consistent with heterogeneity in the original sample, causing large blocks of genes to correlate and anti-correlateWe then analyzed each cluster independently by BigSur, re-calculating PCC′ values and statistical significance for each group of cells. Surprisingly, in each cluster BigSur again found two large, strongly anti-correlating communities, one of which contained the mitochondrial-encoded genes and the other the ribosomal protein genes. This led us to further subcluster cluster 1, again using the 50 most-highly connected genes in each of these two communities as features and subdivided it into two groups of 4226 (cluster 1.1) and 1582 cells (cluster 1.2).Variable expression of mitochondrially-encoded genes is a common finding in scRNAseq. Their presence at high levels (e.g. > 25% of total UMI) is usually interpreted as an indication of a “low quality” cell—potentially one in which the plasma membrane has ruptured and cytoplasm has been lost—or perhaps a cell in the process of apoptosis [53]. Closer examination of the cell clusters identified in this dataset suggests these phenomena are likely only part of the explanation. In Fig. 4C, we plot mitochondrial-encoded UMI versus total UMI for the entire set of cells, coloring them according to the four cell clusters mentioned above. Several distinct behaviors were noted. Cell cluster 2 forms a coherent group with high mitochondrial expression that is linearly proportional to total UMI. Throughout this group, the percentage of total transcripts that is mitochondrial remains in a narrow band, with mean of 34% and coefficient of variation (CV) of 0.41. Cluster 3 displays low total UMI, and a mitochondrial fraction centered around a mean of 21% (CV = 0.43). Cluster 1.2 has high mitochondrial expression, but even higher total UMI, such that the mitochondrial fraction averages 9.7% (CV = 0.45), while cluster 1.1 has both low total UMI and low mitochondrial UMI (average mitochondrial fraction 8.2%, CV = 0.47).In Fig. 4D, we also plot total UMI derived from ribosomal protein genes against total UMI. Of all the clusters, cluster 2 best fits the expectation for “damaged” cells: Mitochondrial and ribosomal UMI rise proportionately with total UMI (consistent with randomly varying sequencing depths), but the proportion of mitochondrial UMI is almost three times higher, the proportion of ribosomal UMI almost three times lower, and total UMI about 2.5 times lower than in cluster 1.2, the only other cluster that displays a wide range of sequencing depths. Yet it is curious that the mitochondrial proportions in cluster 2 are so narrowly distributed around a mean; one might expect variable degrees of cytoplasm leakage following cell damage to produce a broad distribution beginning near cluster 1.2 and gradually tapering off. The absence of such behavior suggests that such cells are not merely variably damaged during preparation, but represent a distinct, possibly pre-existing state, perhaps associated with some form of cell stress or death (although we see no significant enrichment of gene expression associated with apoptosis [54] in cluster 2).Even if we remove cluster 2 from further analysis, clusters 1.1, 1.2 and 3 also differ between each other in relative proportions of mitochondrially-encoded, ribosomal and total transcripts; however, the way in which they do so is not suggestive of any simple mechanism of cell damage (and overall mitochondrial content is not in the range that would necessarily lead to exclusion from downstream scRNAseq analyses). As mentioned above, using the genes in the mitochondrially-encoded- and ribosomal-enriched communities as features, analysis of gene correlations using BigSur separately on each of the four clusters revealed anti-correlating communities of mitochondrially-encoded and ribosomal genes in every case (Fig. 5A–D; Table 2). In fact, even after another round of subclustering of the subclusters derived from cluster 1 (cluster 1.1 and cluster 1.2) using these genes as features, BigSur still identified distinct, anti-correlating mitochondrially-encoded and ribosomal communities (not shown). These data strongly suggest that, notwithstanding effects of cell damage or cell death, there exists among healthy cells continuous, anti-correlated variation in both the mitochondrially-encoded and ribosomal protein genes, suggestive of some biologically meaningful regulatory relationship (discussed further below).Fig. 5Mitochondrial communities are a source of strong anti-correlations. Mitochondrially-encoded and ribosomal protein gene communities were identified in all clusters. Anti-correlations identified specifically between mitochondrially-encoded genes and ribosomal protein genes within the different clusters are shown in A (cluster 1.1), B (cluster 1.2), C (cluster 2) and D (cluster 3). Panel E shows additional anticorrelations in cluster 1.2 involving mitochondrially-encoded genes and glycolysis genes For ease of readability, mitochondrial genes have been highlighted in blue. Red links refer to negative correlations; green to positiveAnalysis of gene communitiesFigure 4E summarizes the results of using BigSur to identify significant correlations (FDR < 0.02) in each of the cell clusters described above (1.1, 1.2, 2 and 3). Because statistical power to detect correlations falls with number of cells analyzed, one might expect to see the fewest significant correlations in the smallest clusters, but this was not the case. Instead, the clusters with the largest number of significant correlations were those with the highest average sequencing depth (Fig. 4C), suggesting that data sparsity has an especially strong influence on correlation detection.Schadt and colleagues have recently argued [55] that negative correlations may be considered a strong indicator of cell heterogeneity. These authors argue that minimization of the proportion of negative gene–gene correlations provides a principled metric for determining when to cease sub-clustering cells. Consistent with this view, we observed that, as cells were successively subclustered, the proportion of negative correlations fell from 45% to about 20% (Fig. 4E). The clusters with the lowest proportion of negative correlations were clusters 1.2, 2 and 3. As cluster 2 had very high levels of mitochondrially-encoded genes (33.7% of UMI), and cluster 3 had the lowest average UMI/cell (1273) of any cluster, we focused our analysis on cluster 1.2, which had both low levels of mitochondrial genes (9.7% of UMI) and high average UMI/cell (6252)..Within cluster 1.2, we found that enrichment for correlated paralog pairs and genes whose products interact physically was much higher than when all 8640 cells had been considered together. Specifically, among the positive correlations, paralog pairs were enriched 55-fold, and links supported by protein–protein interactions 32-fold, over what would have been expected by chance. Indeed, 15% of all positive correlations in cluster 1.2 corresponded to links supported by known protein–protein interactions.Table 1, Additional file 2: Table S1 and Figs. S4–S7 (Additional files 6, 7, 8, 9) summarize results for 13 of the largest gene communities detected in cluster 1.2, which collectively account for 1291 of the 2519 genes that showed any significant positive correlations. Table 1 groups the genes of each community into functional categories according to annotations found in the most recent release of the MSigDB database [56, 57]. Annotations identified using the Database for Annotation, Visualization and Integrated Discovery (DAVID [58]) are also shown in Additional file 2: Table S1. In 11 of the 13 communities (accounting for 1217 of the 1291 genes), over 50% of the genes could be associated with just one or a few functional annotations (Table 1).Table 1 Gene communities identified in cell subcluster 1.2For example, communities A and B consist primarily of genes related to the cell cycle, with more than 62% of the genes in community A known to be preferentially associated with the G2 and M phases of the cell cycle. About 71% of community B is associated with the cell cycle, the majority of these being associated with G1 and S phase. These communities were also correlated with each other, but the community-finding algorithm easily subdivided them.Most other communities are at best weakly correlated with either of the cell-cycle communities. In community C, 34% of genes encode proteins involved in the unfolded protein response and/or endoplasmic reticulum (ER) stress, and other ER proteins account for another 22% of genes. ER stress is commonly observed in cancer, and the unfolded protein response is specifically and strongly activated in melanoma cells [59,60,61].Community D contains almost all genes encoding protein components of cytoplasmic ribosomes, plus a large set of genes that regulate ribosome biogenesis or function. Overall, ribosome-related genes make up 57% of this community. Community E combines genes involved in cellular respiration, glycolysis, and oxidative and hypoxic stress, which collectively account for 78% of this community.Fifty genes in community F, nearly half the total, are associated with melanin synthesis and melanosome biogenesis, and include traditional melanocyte markers such as MLANA, PMEL, RAB27A and TYR. Another subset in community F, consisting of 32 genes, shares annotations related to markers of myeloid cell types (monocytes, macrophages, basophils, etc.) but largely consists of ubiquitously expressed genes involved in processes such as lipid biosynthesis. At least one of these genes, TRIM63, is known to be strongly associated with melanoma [62], and is a validated target of MITF [63], the primary transcription factor controlling expression of pigmentation genes.Community G contains multiple genes involved in RNA processing, including seven genes encoding members of the ubiquitously expressed heterogeneous nuclear ribonucleoprotein (HNRNP) family, as well as genes encoding protein chaperones of the HSP10, HSP40 HSP60, HSP90, and CCT families. Community H contains most of the mitochondrially-encoded genes, as well as genes involved in protein ubiquitination and the hypoxia stress response.Community I combines four genes encoding S100-family calcium binding proteins, as well as various genes associated with endosome function and lipoprotein metabolism. Unlike other communities, in this community most genes do not fall into large groups with traditional annotations. However, many of them are genes known to be strongly associated with melanoma. These include APOC1, APOD, CDKN2A, CTHRC1, FXYD3, INPP5F, MT2A, S100A4, SEMA3B, TMSB10 and VGF [64,65,66,67], suggesting this community is detecting genes co-regulated by drivers of a melanoma-specific cell state.Community J combines genes associated with the response to hypoxia with genes involved in Golgi body and lysosome function. Nearly all the genes in community K are associated with cholesterol/sterol biosynthesis and homeostasis; the only exception is PRRX1, which serves as the transcriptional co-regulator of serum response factor. In oligodendrocytes at least, it has been shown that PRRX1 is necessary for the expression of cholesterol biosynthesis genes [68].Community L consists mainly of genes annotated as related to interferon signaling; these genes are also the primary drivers of the cellular anti-viral response. Finally, community M contains several genes associated with growth factor signaling.Figures S4–S7 (Additional files 6, 7, 8, 9) display the gene–gene linkages within these communities graphically. Here, genes are shown as light blue disks, except for transcription factors which appear as yellow squares. The areas of the disks and squares are proportional to the relative mean expression of the genes (absolute scaling differs between panels, as it was adjusted to enhance the readability of each figure). Green lines denote significant positive correlations (no negative correlations were observed within any of the communities shown). In several cases, a smaller version of each graph, in which links supported by protein–protein interaction data have been overlayed with brown lines, is displayed as an inset; for some communities, only the graphs containing these highlights are shown. Examination of Figs. S4–S7 shows that genes associated with a single functional annotation, or that encode directly-interacting proteins, are often especially densely interconnected, causing them to cluster together (e.g., genes encoding directly physically interacting proteins in A and B, unfolded protein response genes in C, ribosomal subunit genes in D, glycolysis genes in E, S100 genes in I, etc.).Whereas the clustering of genes into communities had been carried out based on positive gene–gene correlations, plotting pairs of communities together enabled the evaluation of negative correlations between them, the vast majority of which involved the mitochondrially-encoded genes which, as mentioned previously, strongly anti-correlated with ribosomal protein genes (Fig. 5). Mitochondrial genes also anti-correlated with some of the genes in communities A, F, E and G; for example, anticorrelations between mitochondrial genes and genes encoding glycolytic enzymes are shown in Fig. 5E.Figure 5 and Table 2 also document that the anti-correlation between mitochondrial and ribosomal genes is as apparent in cell clusters 1.1, 2 and 3 as it is in cluster 1.2. This is interesting insofar as the features used to subdivide cells into these clusters included most of mitochondrially-encoded and ribosomal genes. What this suggests is that there are not two separable populations of cells overexpressing mitochondrial versus ribosomal genes. Instead, the data suggest that the observed anti-correlations are not themselves sufficiently correlated with each other to drive cell clustering. In other words, BigSur seems to able to detect local gene–gene relationships that could not have been identified by comparing differential expression of genes in any possible grouping of cells.Table 2 Mitochondrially-encoded and ribosomal protein gene communities identified following unsupervised clustering of each of the four cell clusters, 1.1, 1.2, 2 and 3Estimating the accuracy of BigSurThe association of a small number of annotation categories with nearly all of the gene communities that BigSur found, in an unsupervised manner, in scRNAseq data strongly suggest that many of the correlations detected by BigSur are true positives. Nevertheless, in nearly all communities some genes did not fall into obvious categories (Table 1). Do these just reflect the incompleteness of existing annotations, or did BigSur produce false positive results beyond the ~ 2% expected from the FDR cutoff that was used?Rarely in biology does one have ground truth information with which to settle this question, but one can still perform informative tests. For example, one can consider a subset of the genes that form a functional category and ask whether BigSur correctly identifies other members of the category as being correlated with them. An example is shown in Fig. 6A, where we started with a 27-gene panel, the “Reactome Cholesterol Biosynthesis” gene set from MSigDB (ACAT2, ARV1, CYP51A1, DHCR24, DHCR7, EBP, FDFT1, FDPS, GGPS1, HMGCR, HMGCS1, HSD17B7, IDI1, IDI2, LBR, LSS, MSMO1, MVD, MVK, NSDHL, PPAPDC2, PMVK, SC5D, SQLE, SREBF1, SREBF2, TM7SF2). That panel includes many, but not all, genes known to be involved in cholesterol metabolism. The graph displays all statistically significant positive correlations involving any of these genes that were detected in cell cluster 1.2 (members of the gene panel are indicated with large font, blue lettering).Fig. 6Comparing the ability of BigSur and uncorrected PCCs to identify biologically significant correlations. Using the data from cell cluster 1.2, BigSur identified positive correlations and their p values for all genes, converting the p values to equivalent PCCs. In addition, the same starting data were also normalized and uncorrected PCCs obtained. A–D compare the correlations identified by the two methods. A Genes identified by BigSur as correlating with the Reactome Cholesterol Metabolism gene set. All significant gene–gene correlations detected by BigSur involving genes in this 27-gene set (the names of which are shown at right) and all other genes in the genome are shown graphically. Genes belonging to the set are highlighted in blue; additional known cholesterol metabolism-related genes are highlighted with rectangles. Green lines show significant positive correlations. Dashing surrounds a group of cell cycle genes that correlate with the dual-function gene LBR. Expression levels (mean UMI per cell after normalization) for each of the genes in the set are shown at right; genes highlighted in yellow are those that displayed any significant correlations. B Correlation communities for the same gene set, panel identified using uncorrected PCCs, visualized as in panel (A). Note that the target genes are scattered among 5 disconnected communities and are connected with genes with no obvious relationship to cholesterol metabolism. C Positive correlations involving any of the genes belonging to the Reactome Cholesterol Metabolism gene set were divided into “within-community” links and “out-of-community” links. With BigSur (filled symbols), the ratio of within-community to out-of-community links is much higher than with uncorrected PCCs (open symbols), suggesting the latter produce much less enrichment for functionally relevant connections. D Analyses similar to that in panel C were performed for eight additional MSigDB “Hallmark” gene sets. Plotted are the proportions of correlations that are within-group over a range of PCC thresholds. Numbers of genes in each panel are shown in parentheses. E Analyses similar to that in panel (D) were performed with 150-gene sets chosen at random after first removing genes with mean expression below 0.052 (so that median expression for random genes was similar to that of the functional panel genesAll detected members of the gene set formed a single network, with many internal positive linkages. Closely connected to this network were eight additional genes (outlined in red) that are not part of the Reactome Cholesterol Biosynthesis set, but belong either to the “Hallmark” cholesterol homeostasis set from MSigDB (ACSS2, ACTG1, LDLR, SCD, STARD4, TMEM97); are designated as core cholesterol metabolism genes in recent literature (ACLY [69]; C14or1/ERG28 [70]); or encode sterol binding proteins that serve as feedback controllers of cholesterol biosynthesis (INSIG1 [71]). Highlighted in orange are two additional genes that regulate lipid biosynthesis more generally, LPIN1 and PRRX1.Four additional genes are highlighted in gray, as it is possible they are also related to cholesterol metabolism: AKR1A1, CALR, PRDX1, and PMEL. AKR1A1 encodes an enzyme of the aldo/keto reductase superfamily, which reduces a wide variety of carbonyl-containing compounds to their corresponding alcohols; its orthologues AKR1C1 and AKR1D1 are well known to play a role in the metabolism of steroid hormones but AKR1A1 is thought to act on non-steroid compounds (perhaps this assumption should be revisited). CALR, encodes calreticulin, a protein that promotes folding, oligomeric assembly and quality control in the endoplasmic reticulum. As cholesterol synthesis takes place in the ER membrane, it is perhaps not surprising that expression of CALR and cholesterol synthesis genes would be co-regulated. PRDX1 encodes peroxiredoxin, an antioxidant enzyme that reduces hydrogen peroxide and alkyl hydroperoxides and plays a key role in maintaining redox balance. In macrophages, PRDX1 has been shown to be critical for cholesterol efflux during autophagy [72]. PMEL encodes a major component of the melanosome. Interestingly, it has been reported that cholesterol strongly stimulates melanogenesis in melanocytes and melanoma cells [73].In addition to these genes, a set of tightly interconnected genes that have no known association with cholesterol metabolism is peripherally connected to this community (outlined by a dotted line in Fig. 6A). A large proportion of these genes encode proteins associated with the cell cycle, especially with functions associated with mitosis, such as mitotic checkpoint events, and cyto- and karyokinesis. Strongly connected to these mitotic genes is LBR, which encodes the Lamin B receptor. LBR is a multifunctional protein: it plays a key role in cholesterol biosynthesis, catalyzing the reduction of the C14-unsaturated bond of lanosterol, but also anchors the nuclear lamina and heterochromatin to the inner nuclear membrane, and in so doing is strongly regulated by phosphorylation during the cell cycle [74]. A functional link between expression of cholesterol biosynthesis genes and mitosis may arise from that fact that cholesterol synthesis normally rises dramatically during G1 phase and if prevented from doing so will result in G1 arrest [75]. It thus may make sense to upregulate the production of cholesterol biosynthetic genes as cells go into mitosis, so they are available to act in the G1 phase that immediately follows.To the right of the graph in Fig. 6A the genes of the Reactome Cholesterol Biosynthesis set are listed alongside their relative expression levels (UMI/cell after normalization) in this data set. Highlighted in yellow are those for which BigSur identified significant positive correlations, nearly all of which were toward the higher end of expression levels. These results illustrate an important limitation of any correlation methodology, which is that statistical power depends on the number of non-zero entries in the data, which will, in turn, reflect expression level, sequencing depth, and the number of cells analyzed. Examination of the raw data indicates that the point at which BigSur begins to fail to identify significant correlations occurred, for these genes, when the total number of non-zero entries (among 1582 cells) fell to about 320; this corresponded to a mean expression level of about 0.3 UMI/cell. For more strongly correlated genes one should of course expect fewer entries to be necessary to achieve significance, but based on the results here we suggest that, when seeking to identify the kinds of weak correlations associated with noise-coupled gene-regulatory networks, a minimum of several hundred non-zero entries per gene may be a reasonable threshold. In practice, one should be able to adjust the number of genes that fulfill this criterion either by varying the number of cells analyzed or varying the depth of sequencing (or both).Figure 6B plots the results of the identical exercise as in Fig. 6A, carried out using uncorrected PCCs obtained directly from normalized gene expression data, rather than using PCC′ and BigSur. A threshold of PCC = 0.29 was selected so that the number of Reactome Cholesterol Biosynthesis genes that exhibited above-threshold correlations was the same as in Fig. 6A. In this case, however, the cholesterol biosynthesis genes did not form a single community, but rather separated into 5 disconnected groups. Most groups contained links to many genes, none of which appeared, on inspection, to bear any relationship to cholesterol metabolism. These data make the important point that most of the links detected using uncorrected PCCs are likely to be false positives.One way to extend this analysis to other functional gene sets is suggested by the observation that, among the correlations in Fig. 6A involving Reactome Cholesterol Biosynthesis genes, 25 occur between member genes (“within-community”) while 157 involve other genes (“out-of-community”). In contrast, in Fig. 6B there are 143 out-of-community links and zero in-community ones. Figure 6C shows how the numbers of within-community and out-of-community links change as the PCC significance threshold is changed, both for BigSur and uncorrected PCCs. These results suggest that the fraction of all correlations that is in-community can be used as a measure of the specificity with which functionally relevant correlations are identified by any given method. In Fig. 6D, we used this approach to contrast the performance of BigSur (filled symbols) with uncorrected PCCs (open symbols) for 8 different gene sets, each of which contained between 49 and 200 members. Using BigSur, the proportion of links that were in-community varied between ~ 1% and 30% depending on the gene set, usually increasing as the equivalent PCC threshold was made more stringent. Using uncorrected PCCs, the proportion of in-community links varied between 0.2% and 1% and did not change appreciably with PCC threshold. For comparison, Fig. 6E performs a similar analysis using 8 “random” gene sets—sets of 150 genes randomly selected from genes with similar overall expression to those in the gene sets used in Fig. 6D. The results suggest that the performance of BigSur on functionally related genes is far above chance level, while uncorrected PCCs perform at or around that level.Induced changes in gene expression are associated with increased gene–gene correlationIf BigSur detects physiologically meaningful gene correlations, one might expect genes induced in response to specific perturbations to be correlated with each other. In other words, among genes that differ between treated and untreated conditions, one might also expect to observe groups of the same genes displaying correlations within each individual condition—presumably because they represent genes regulated by common upstream signals. To test this, we turned to a scRNAseq study in which differentiated human airway epithelial cells were treated with IL-13, a model of cytokine-induced asthma [76]. We focused on a single subset of cells, those labeled “defense secretory”, as similar numbers of them were captured in both treated and untreated conditions (see Methods).Data from control and IL13-treated cells were then analyzed separately using BigSur (Fig. S8/Additional file 10). First, we focused on the 419 genes reported to be upregulated by IL13 [76]. Within the treated group we found 313 of these genes were also correlated with each other in a single, large community (Fig. S8A). Interestingly, 45 of these genes were correlated in control cells as well (Fig. S8B), suggesting that a module of co-regulated gene expression is already active prior to treatment.Next, we asked whether BigSur could identify co-regulated genes that had not been picked up as differentially expressed (perhaps because their expression levels did not change sufficiently after treatment). We focused on genes that correlated with one particular gene, MUC16, which encodes an IL13-induceable mucin thought to play a prominent role in asthma-associated mucus obstruction of the airway [76]. We found 168 genes to be significantly positively correlated with MUC16 in IL13-treated cells (Fig. S9/Additional file 11), only 23 of which were among those differentially expressed in response to IL13 [76]. Among the remaining 145 genes were found MUC4, another mucin gene; SYTL2, a paralog of SYT2, which encodes a limiting factor in mucin secretion [77]; and THSD7B, which encodes a direct interactor of SYT2 and SYTL2.Also among the genes that correlated with MUC16 in IL13-treated cells were 5 of 11 genes previously identified by GWAS as associated with risk of childhood asthma: PDE4D, PTPRD, RBFOX1, ROBO2, and RYR2 [78]; MUC16 also correlated with ROBO1, which encodes an interaction partner of ROBO2. The top-ranked GWAS gene, RYR2, encodes a calcium channel thought to play an important role in asthma pathogenesis [78]. Neither RYR2, nor any of the other 11 GWAS associated genes, were detected as differentially expressed after IL13 treatment, although two genes encoding proteins that interact with RYR2 (CALM1 and CMK2D) were [76]. These data suggest that the analysis of gene correlations has the potential to substantially augment biological insights that come from studies of differential gene expression.Aggregation into “meta-cells” impedes the identification of gene correlationsThe ability of BigSur to detect biologically relevant gene correlations with controllable false discovery reflects its ability to handle the highly skewed distributions of scRNAseq data, distributions that are usually dominated by many zeros. In the scRNAseq literature, several other approaches have been proposed to compensate for data sparsity, such as filling in zeros by imputation (estimating values based on closely related cells) or aggregating groups of cells (merging measurements in similar cells, or across technical replicates of the same cells). Aggregation essentially replaces groups of cells with a smaller number of “metacells” or “pseudocells”, essentially implementing a small-scale version of “pseudobulking” [79,80,81,82,83]. This approach has proved useful in improving the accuracy of identifying differentially expressed genes, and in analyses in which genes are clustered into weighted gene co-expression networks [84, 85].To the extent that aggregation reduces data sparsity, one might expect it also to improve accuracy in discovering gene–gene correlations. In fact, at least one group [86] has proposed this explicitly, developing a method that starts with normalized, log-transformed scRNAseq data, and averages values among groups of cells that are judged sufficiently similar (based on Leiden clustering). Using this method, they claimed that one can enhance the ability to detect meaningful correlations in scRNAseq data.We investigated this claim, however, and came to the opposite conclusion. We discovered that grouping cells into metacells markedly inflates false positive correlations, a phenomenon not explored by [86]. This effect is easily demonstrated by applying their procedure to synthetic data in which gene expression values were sampled randomly and independently from Poisson-lognormal distributions (coefficient of variation of underlying lognormal distribution = 0.5), with cells sequenced to different depths, as in Figs. 1 and 2. Using such data as a starting point, we calculated correlation coefficients six different ways (Fig. 7A–C), using the Fischer formula to define the threshold for statistical significance (with PCC′ we also used a Benjamini–Hochberg adjusted FDR threshold of < 2%, as determined by BigSur). In such a data set, an accurate method should detect no significant correlations, positive or negative.Fig. 7Grouping into “metacells” creates false positive correlations. A–C Analysis of synthetic, uncorrelated data. A Pipelines for data processing. Each box denotes a step at which a Pearson correlation coefficient (PCC or PCC′) was calculated, with the color of the box corresponding to the colors used in the following plots. B Histograms of correlation coefficients obtained from the data sets in panel (A). Arrows show the thresholds above which observed correlations were judged statistically significant. C Numbers of correlations in panel (B) that were judged to be significant (p < 0.02). D–F Analysis of melanoma cell line data. D Pipeline for data processing. The colors of each box correspond to the colors in panels (E, F). E Number of correlations judged significant in data sets in D. Darker shading denotes the negative correlations; lighter are positive correlations. F Enrichment for known protein–protein interactions among the correlations shown in (E). The value of n in each case gives the absolute number of protein–protein interactions. Of the 12,129 interactions detected by BigSur, 11,373 were also detected using grouping of log-normalized data and 10,773 were detected using grouping of modified corrected Pearson residuals (10,324 were shared among all three)As expected, using PCCs calculated directly from raw, unnormalized data, nearly 100 million artifactually significant positive correlations were identified, and these were not eliminated by data normalization (scaling UMI values to sequencing depth). Log-transformation of the data (using a “pseudocount” of 1) reduced the number of false positives to 382,511, probably because it greatly reduced the range of variability between high and low values. In contrast, when log-transformed values were grouped and averaged according to the procedure of [86], clustering the 3570 cells and grouping them into 100 metacells, we observed a massive inflation of large positive and negative values of PCC (Fig. 7B, C), among which over 2 million were judged significant by the Fischer formula (the formula automatically takes into account the reduced number of cells). This was not just a consequence of failure to compensate for variable sequencing depth, because even when we created data with equal sequencing depth across all cells, we still observed that grouping into metacells produced just as many false correlations (Fig. 7B, C).In contrast, PCC′, calculated from unnormalized data, did much better, producing fewer than 78,000 false correlations, as judged using the Fischer formula, and no correlations, positive or negative, when the FDR was appropriately calculated using BigSur.These results indicate that, even if grouping cells increased the likelihood of detecting true correlations, the effect would likely be overwhelmed by the creation of so many false ones. To assess the net effect, we compared the method of [86] with BigSur using the full set of melanoma cells analyzed in Fig. 3. As diagramed in Fig. 7D, we performed grouping both on Log-transformed normalized data, as per [86], and on modified corrected Pearson residuals (Pij’), hoping that the latter might better correct for variable sequencing depth. Consistent with what was seen with synthetic data, grouping produced many more correlations than BigSur, although the use of modified corrected Pearson residuals reduced this to some extent (Fig. 7E).To assess the recovery of true positives, we quantified enrichment for pairs of genes with known protein–protein interactions (Fig. 7F), as such genes are more likely than random genes to be co-expressed. As described earlier, such interactions were enriched more than seven-fold among the pairs of genes found significantly positively correlated by BigSur. In contrast, even though grouping into metacells identified significant correlations among more total gene pairs with known protein–protein interactions, the enrichment over what would have been expected by chance was less than two-fold, suggesting an accuracy barely better than chance levels.The reason grouping generates so many false positive correlations appears to be generic to the process of aggregating cells by similarity, and not specific to details of the method of grouping of [86], as we can see similar effects using synthetic data and other, simpler algorithms for grouping cells. These observations nicely illustrate how, when dealing with correlations, even seemingly innocuous empirical fixes can create unexpected problems, and underscore the value of BigSur in provided a principled, theory-grounded approach to assessing both the magnitude and significance of gene–gene correlation.

Hot Topics

Related Articles