Best practices for differential accessibility analysis in single-cell epigenomics

Lack of consensus in single-cell DA analysisWe first sought to map the landscape of statistical methods that have been used to perform DA analysis. For this purpose, we conducted a comprehensive survey of the single-cell epigenomics literature. We identified a total of 118 primary publications that reported single-cell epigenomic datasets (Supplementary Data 1). This survey confirmed the widespread adoption of scATAC-seq (Fig. 1a). Chronological analysis revealed that the number of cells profiled in any given study has increased exponentially over time (Fig. 1b). Among the 13 statistical methods for DA analysis that were detected in this survey, the Wilcoxon rank-sum test was the most widely used (Fig. 1c). However, no method was used in more than 15 studies, and many DA methods were used in just one or two published analyses. Beyond this unsettling observation, we also observed disagreement on fundamental principles of DA analysis, such as whether or not to binarize measures of genome accessibility (Fig. 1d). This lack of consensus is reflected in the variety of DA methods that are implemented by default within the most widely used analysis packages for scATAC-seq (Fig. 1e).Fig. 1: Landscape of DA analysis for single-cell epigenomics.a Experimental techniques used in 118 primary publications that reported single-cell epigenomic datasets. Inset pie chart shows the proportion of studies (64%) that reported a DA analysis. b Number of single cells profiled by scATAC-seq in 91 primary studies, shown as a function of publication date to highlight exponential scaling of scATAC-seq over time. Trend line and inset p-value, linear regression; shaded area, 95% confidence interval. c Statistical methods for DA analysis employed in 118 single-cell epigenomics papers. DA methods shown in grey will be considered in our analysis. “Other” includes four additional methods used in just a single study. Inset pie chart shows the total proportion of single-cell epigenomics papers (94%) that employed a DA analysis method considered in this Registered Report. d Proportions of single-cell epigenomics studies that have treated the data as binary or continuous, respectively, during DA analysis. e Default statistical methods for DA analysis implemented in 13 single-cell analysis packages. Top, number of citations per package. Right, total number of analysis packages in which each DA analysis method is implemented as the default. f Cumulative distribution functions showing statistical properties of RNA-seq and ATAC-seq data from matching single cells. ATAC-seq features (peaks) are characterized by a lower average sequencing depth and a higher proportion of zeroes. Data is from a 10x multiome dataset from the mouse spinal cord (Methods). Source data are provided as a Source Data file (Source Data 1).Our survey of the single-cell epigenomics literature highlighted that the most widely used statistical methods in this field are based on, or identical to, methods that were originally developed for scRNA-seq. The application of similar statistical methodologies contrasts with the different facets of biology measured by each technology. Whereas scRNA-seq measures expressed genes, scATAC-seq measures the accessibility of the entire genome, and accordingly, we identified dramatic differences in the statistical features of scATAC-seq compared to scRNA-seq data (Fig. 1f). In particular, scATAC-seq measures a larger number of features compared to scRNA-seq, and each of these features are quantified by fewer reads and in fewer cells47. These biological and technological differences raise the possibility that the statistical methods used for DE analysis of scRNA-seq data may be ill-suited to DA analysis of scATAC-seq data.Epistemological framework for biologically accurate DA analysisIf the most prevalent statistical methods are not optimised for DA analysis, then they may overlook biological differences, or conversely, could lead to spurious discoveries. These possibilities compelled us to conduct a comprehensive comparison of DA methods for scATAC-seq data. As a prerequisite for such a comparison, we first recognized the necessity of an epistemological framework that would capture the biological accuracy of these methods48. Whereas the majority of benchmarks in computational biology rely on simulated datasets, we previously showed that these simulations fail to appreciate essential aspects of biological data generation, and therefore lead to unreliable conclusions49. Instead, we showed that comparisons of statistical methods based on real datasets with experimental ground truth do capture biological differences in the performance of these methods. In the context of scRNA-seq data, we showed that a close approximation to the ground truth can be obtained from matched bulk and scRNA-seq performed on the same population of purified cells, exposed to the same perturbation, and sequenced in the same laboratory.We hypothesized that a similar epistemological framework would allow us to establish the biological accuracy of statistical methods for DA analysis. To enable this framework, we identified a series of published datasets in which matched bulk and single-cell ATAC-seq were used to profile the same populations of purified cells within the same laboratories. These matched bulk datasets provide a mechanism to evaluate the biological accuracy of single-cell DA methods.We also postulated that the development of single-cell multi-omic assays50, in which the epigenome and transcriptome are profiled in the same individual cells, would provide the opportunity to extend this epistemological framework. Concretely, epigenomic measurements from multi-omic assays can be aggregated to the level of genes, and therefore, we reasoned that these assays would allow us to compare DA and DE within the same individual cells. The biological hypothesis underlying this experiment is that DE genes expressed across biological conditions are likely to have promoters that are DA within the same individual cells. Previous work has established that this assumption holds across the genome as a whole when DE and DA are measured systematically51,52,53.The third and final component of our epistemological framework is the recognition that the biological accuracy of any statistical method is contingent on its ability to avoid producing false discoveries49. We therefore compared DA methods based on their ability to avoid false discoveries in the absence of any true biological differences.Experiment 1: Evaluating single-cell DA methods with matched bulk dataWe first used our assembled compendium of matched bulk and single-cell ATAC datasets to evaluate the biological accuracy of each single-cell DA method, using the bulk data as a reference. Our survey of the literature identified five studies in which matching single-cell and bulk epigenomics data were collected from the same populations of purified cells and sequenced within the same laboratory (Fig. 2a). These studies collected between two to four scATAC-seq libraries per condition (Supplementary Data 2). We performed DA analysis of both the bulk and single-cell ATAC-seq datasets, using each of the 11 DA methods that had been employed by at least two publications at the time of our literature review. We measured the concordance between single-cell and bulk DA analyses using the area under the concordance curve (AUCC)54, as employed by previous benchmarks of DE methods for single-cell transcriptomics data49,55.Fig. 2: Evaluating single-cell DA methods with matched bulk data and single-cell multi-omics.a Design of Experiment 1. DA analysis was performed between cell types or conditions for single-cell datasets with matching bulk ATAC-seq as a reference. Peaks were called either in the bulk ATAC-seq data (primary analysis) or in pseudobulk single-cell data (sensitivity) analysis). b Area under the concordance curve (AUCC) for single-cell DA methods in Experiment 1, using matching bulk ATAC-seq as a reference (n = 16 comparisons). Inset text shows the median AUCC. Methods that aggregate counts within replicates to form ‘pseudobulks’ are shown in shades of red; one method that aggregates counts across replicates is shown in green; and methods that do not aggregate information across cells are shown in blue. c Design of Experiment 2. DA analysis was performed between cell types using matched scRNA-seq data from the same cell as a reference, comparing DA of genomic intervals around the TSS to gene-level differential expression (n = 306 comparisons). d Area under the concordance curve (AUCC) for single-cell DA methods in Experiment 2, using matching snRNA-seq as a reference. Inset text shows the median AUCC. e As in d but showing concordance at the level of GO terms enriched among DA peaks. Source data are provided as a Source Data file (Source Data 2).In our primary analysis, we observed that most DA methods achieved comparable performance, with relatively small differences separating the ten top-performing methods (Fig. 2b). Among these, methods that aggregated cells within biological replicates to form so-called ‘pseudobulks’ consistently ranked near the top. In contrast, negative binomial regression and a previously described permutation test19 were outliers that achieved substantially lower concordance to the bulk data than other DA methods.We conducted a series of sensitivity analyses to test the robustness of these observations. First, we found that the performance of single-cell DA methods was largely unchanged when applying different DA approaches to establish the experimental ground truth within the bulk data (Supplementary Fig. 1a, b). Moreover, we obtained broadly consistent results when varying the number of top-ranked DA peaks used to calculate the AUCC, or when filtering peaks that were not accessible in at least 5% of cells (Supplementary Fig. 1c). Because calculating the concordance between DA analyses of single-cell and bulk ATAC-seq requires a matching set of peaks to be defined in both datasets, we also explored the impact of varying this peak set. In all of the above analyses, we had called peaks in the matched bulk ATAC-seq data, but widely used software packages such as Signac35, ArchR33, and SnapATAC34 instead call peaks in ‘pseudobulk’ samples created by pooling the single-cell data. We found that this procedure improved concordance between single-cell and bulk ATAC overall, but that the relative performance of single-cell DA methods remained similar (Supplementary Fig. 1d). To address the possibility that artefacts in peak calling might confound these results, we also devised a procedure to introduce spurious peaks into the peak sets, and found that the relative performance of the single-cell DA methods was largely robust to the presence of noise (Supplementary Fig. 1e). Finally, we evaluated concordance separately for peaks in promoter versus enhancer regions, and found that differences between DA methods were apparent primarily in the latter (Supplementary Fig. 1f).Experiment 2: Evaluating single-cell DA methods with single-cell multi-omicsWe next identified four studies that employed multi-omic assays to quantify both gene expression and chromatin accessibility across tens of thousands of nuclei, each profiling between two and 13 replicates (Supplementary Data 2), and leveraged these datasets to repeat the comparisons of single-cell DA methods described in Experiment 1, now using gene expression as the reference (Fig. 2c). To enable comparison between the ATAC and RNA modalities, we aggregated chromatin accessibility around promoters into gene-level activity scores, as implemented in a number of widely used software packages33,35,56. We then tested for differences in chromatin accessibility at the gene level, using the DE results from the matching RNA modality to define the reference. Moreover, we performed GO enrichment analyses of the differentially accessible genes, and evaluated the concordance between GO enrichment analyses of the ATAC and RNA modalities.In our primary gene-level analysis, we identified a more pronounced difference between DA methods that aggregated cell-level chromatin accessibility profiles into ‘pseudobulks’ and those that did not, as compared to Experiment 1 (Fig. 2d). The former outperformed the latter, although all pseudobulk methods achieved comparable performance to one another.We conducted a series of sensitivity analysis to assess the robustness of this result. First, we verified that these results were not confounded by genes with overlapping promoter regions, since removing these genes led to slightly improved concordance but had little impact on the relative performance on single-cell DA methods (Supplementary Fig. 2a). We also confirmed that these results were largely unchanged when (i) applying different statistical approaches to establish the experimental ground truth within the RNA modality; (ii) varying the number of top-ranked DA genes used to calculate the AUCC, or (iii) filtering genes whose promoters were not accessible in at least 1% of cells (Supplementary Fig. 2b–d). Finally, previous work has shown that inferences about DE are generally more accurate for highly expressed genes57,58, whereas identifying instances of true DE among lowly-expressed genes can be challenging49. These observations raised the possibility that inaccurate inferences about DE for lowly-expressed genes might confound our analysis. Therefore, we re-evaluated the concordance after excluding the bottom tercile of lowly-expressed genes, and found that the relative performance of DA methods was essentially unchanged (Supplementary Fig. 2e).In our primary Gene Ontology-level analysis, we again observed that the four top-performing methods all aggregated cells to form pseudobulks (Fig. 2e). We then repeated each of the above sensitivity analyses at the level of Gene Ontology terms. In general, the relative performance of single-cell DA methods was robust to the procedures used to identify the experimental ground truth within the RNA modality, varying the number of top-ranked GO terms used to calculate the AUCC, or filtering genes whose promoters were not accessible in at least 1% of cells (Supplementary Fig. 3b–d). In contrast, the relative performance of single-cell DA methods was somewhat more sensitive to the removal of genes with overlapping promoters or by the removal of lowly-expressed genes in the RNA modality, both of which improved the performance of the t-test (Supplementary Fig. 3a, e).Notably, in all twelve primary and sensitivity analyses at the gene and GO term levels, a pseudobulk DA method achieved the best performance. Moreover, no pseudobulk DA method ranked among the bottom half of lowest-performing DA methods in any of these analyses.Experiment 3: False discoveries in single-cell DAThe possibility of profiling hundreds of thousands of cells with single-cell epigenomics presents both opportunities and challenges. On one hand, the statistical power afforded by scATAC-seq data should enable the detection of subtle changes in chromatin accessibility between conditions or cell types. On the other hand, these subtle changes may represent technical artefacts or even false discoveries rather than true biological differences. We exposed similar opportunities and challenges in our recent analysis of DE methods for scRNA-seq, wherein we found that many of the most widely used methods can produce thousands of false discoveries in routine experiments49. We investigated whether similar phenomena arise in scATAC-seq data. For this purpose, we conducted a series of analyses to assess the emergence of false discoveries in DA analyses of both real and simulated datasets.We first investigated the emergence of false discoveries in random comparisons of cells from the same experimental condition. For this purpose, we repurposed a large scATAC-seq dataset59 to create artificial comparisons between cells from identical experimental conditions, with the expectation that any regions that are called as DA reflect statistical false discoveries rather than instances of true DA (Fig. 3a). In our primary analysis, we found that several widely-used DA methods identified thousands of differentially accessible peaks in the absence of any biological differences (Fig. 3b). Strikingly, the three most frequently used DA methods in our survey of the literature (Wilcoxon rank-sum test, LRclusters, and LRpeaks; Fig. 1c) identified a median of 4664, 2505, and 8721 DA peaks within any given cell type. Conversely, methods that aggregated single-cell chromatin accessibility profiles to form pseudobulks never identified more than a median of 9 DA peaks in these random comparisons.Fig. 3: False discoveries in single-cell DA analysis.a Schematic overview of Experiment 3.1. Bone marrow mononuclear cells from healthy donors were profiled by Luecken et al.59 in 13 independent replicates. For each cell type, half of these replicates were assigned to an artificial ‘control’ group, and the other half to an artificial ‘treatment’ group. DA analysis was then performed between cells from randomly assigned replicates. b Number of DA peaks detected between randomly assigned replicates at 5% FDR within each cell type in random comparisons of published scATAC-seq data (n = 21 comparisons). Inset text shows the median number of DA peaks per method. c As in b but showing the number of DA peaks detected at 5% FDR in DA analysis of downsampled bulk ATAC-seq libraries without biological differences between experimental conditions. d As in b but showing the number of DA peaks detected at 5% FDR in model-based simulations of scATAC-seq data without biological differences between experimental conditions. e As in b but showing the number of DA peaks detected at 5% FDR in model-based simulations of scATAC-seq data without biological differences between experimental conditions, with variation in sequencing depth between libraries. Source data are provided as a Source Data file (Source Data 3).The premise of the above experiment is that any regions that are identified as DA between pairs of randomly assigned replicates are unlikely to represent biological differences. However, the nature of real-world data does not allow us to formally exclude the possibility that heterogeneity between replicates introduces true biological differences. Therefore, we complemented our re-analysis of published data with two simulation studies. In the first of these, we downsampled bulk ATAC-seq libraries to simulate scATAC-seq data with no biological cell-to-cell variation at all34. In our primary analysis, we again observed that several widely-used DA methods produced thousands of false discoveries, but that pseudobulk methods were much less prone to false discoveries (Fig. 3c).To achieve more precise control over the composition and properties of simulated scATAC-seq datasets, we leveraged a simulation framework originally developed for scRNA-seq data60. We fit the parameters of this simulation to scATAC-seq data, and used the resulting model to simulate datasets that varied in the number of cells and libraries sequenced and the degree of technical variation between libraries. In our primary analysis, we again identified thousands of false discoveries from widely-used DA methods (Fig. 3d). However, we observed differences in the specific DA methods that were most prone to producing false discoveries in this simulation setup, as compared to either null comparisons of real scATAC-seq data or the simulation above based on downsampling bulk ATAC-seq data. We cannot exclude that these differences reflect an artefact of this simulation framework.In secondary analyses, we established that the number of false discoveries was exacerbated by sequencing a greater number of cells or increasing technical variation between libraries, whereas it was reduced by profiling a larger number of replicates (Supplementary Fig. 4a–c). These observations are consistent with the notion that accounting for biological and technical variation between replicates is critical to controlling the false discovery rate49.In our primary analysis, each replicate was simulated with the same sequencing depth, which is in contrast to the variable sequencing depths of replicates in real-world datasets. Therefore, we also carried out a secondary analysis whereby we deliberately simulated replicates with different sequencing depths (Supplementary Fig. 4d). This simulation markedly increased the number of false discoveries returned by most single-cell DA methods, although pseudobulk methods remained robust to false discoveries (Fig. 3e).Together, these experiments emphasized that many widely used single-cell DA methods can produce thousands of false discoveries. Notably, these false discoveries were exaggerated when simulating more technically or biologically heterogeneous libraries. Conversely, DA methods that aggregated cells within replicates to form pseudobulks showed a markedly improved capacity to avoid false discoveries. We therefore also studied other strategies to account for variation between replicates in DA analysis. First, we explored the impact of including replicate as a fixed effect in generalized linear models. Second, we also considered replicate as a random effect by fitting negative binomial generalized linear mixed models (GLMMs)61. These experiments revealed that incorporating replicate as a fixed effect did not decrease, and sometimes increased, the number of false discoveries. However, incorporating replicate as a random effect enabled DA analysis without false discoveries (Supplementary Fig. 4e).Experiment 4: Evaluating biases of single-cell DA methodsIt is now well-established that many statistical methods for DE analysis of both bulk and single-cell RNA-seq data exhibit biases in the types of genes that are preferentially identified as DE49,55,57,58,62. However, it remains unclear whether similar biases may affect DA analyses of scATAC-seq data. We hypothesized that the same biases that affect scRNA-seq data could manifest in scATAC-seq data as a bias of DA methods to preferentially identify peaks that are open in a larger proportion of cells, peaks that are supported by a greater number of sequencing reads, or peaks that are wider. To address these possibilities, we conducted a series of experiments using both real and simulated data to address whether these biases affected DA analysis of scATAC-seq data.We first characterized the properties of DA peaks in the same published scATAC-seq datasets examined in Experiment 1. To control for differences in the total number of peaks called as DA by each statistical method, we ranked peaks by their p-values and limited our analysis to the top-1000 DA peaks called by each method. In our primary analysis, we identified a number of differences in the characteristics of the peaks preferentially called as DA by each method. Methods that treated chromatin accessibility as a quantitative measurement (t-test, Wilcoxon rank-sum test, negative binomial regression, LRclusters) preferentially called peaks supported by a greater number of reads, and open in a greater number of cells, as being DA (Fig. 4a, b). In contrast, methods that treated accessibility as a binary phenotype (Fisher’s exact test, LRpeaks, binomial test, permutation test), as well as pseudobulk DA methods and SnapATAC’s findDAR test, exhibited less bias towards highly accessible peaks. We observed less variability in the widths of peaks preferentially called as DA by each method, with the notable exception of negative binomial regression, which consistently called wider peaks as DA (Fig. 4c).Fig. 4: Biases in single-cell DA analysis.a Mean read depth of the top-1000 DA peaks identified by each single-cell DA method in published scATAC-seq datasets (n = 16 comparisons). Inset text shows the median across comparisons. b As in a but showing the proportion of cells in which these peaks are open. c As in a but showing the width of each peak.In secondary analyses, we varied the number of top-ranked DA peaks considered by characterizing the top-500 or top-5000 DA peaks called by each method. The trends observed for the top-1000 DA peaks were broadly conserved in this analysis (Supplementary Fig. 5).Next, we sought to specifically characterize the peaks that were spuriously called as DA in Experiment 3. We binned peaks into deciles according to each of the properties studied above, and computed both the absolute number as well as the proportion of false discoveries arising from each decile. Across all three false discovery experiments, most DA methods exhibited biases towards peaks supported by a large number of reads, which were open in a greater proportion of cells, and broader peaks (Supplementary Figs. 6–8). In contrast, methods that aggregated cells to form pseudobulks exhibited a lesser degree of bias, when they identified any false discoveries at all. Unexpectedly, we found that SnapATAC’s findDAR test exhibited an unusual pattern of bias, whereby false discoveries preferentially arose from genes with intermediate expression.Experiment 5: Impact of log-fold change filteringDifferential analyses of ATAC-seq data may discard differentially accessible regions with a log-fold change below an arbitrary threshold, on the grounds that regions with small fold changes are unlikely to be biologically relevant44. Similar practices are ubiquitous in the analysis of other sequencing-based technologies, such as RNA-seq, where the choice of log-fold change threshold may alter the biological interpretation of a given experiment63. These observations motivated an empirical assessment of the impact of log-fold change filtering on single-cell DA analysis.We recognized that the quantitative measures of performance evaluated in these experiments (AUCC or number of false discoveries) are sensitive to the total number of peaks being tested. This introduces a potential confounding factor, in that simply filtering peaks at random would also be expected to increase the AUCC and decrease the number of false discoveries. To account for this effect, for each log-fold change threshold, we tested the effect of removing an equal number of peaks from the dataset at random. We then used this data to determine whether filtering by log-fold change increased the AUCC or decreased the number of false discoveries to a degree greater than random.We first examined the concordance of DA between scATAC-seq and matching bulk ATAC-seq. As expected, increasingly stringent log-fold change filters increased concordance for all DA methods. However, the same effect was also seen when removing equivalent numbers of random peaks (Supplementary Fig. 9). When controlling for random peak filtering, we observed that the concordance initially decreased when filtering up to 70% of peaks based on log-fold change, and then increased when filtering 80% or more of peaks. To summarize these trends, we visualized the effect size of log-fold change filtering across all DA methods, compared to random peak filtering (Fig. 5a).Fig. 5: Impact of log-fold change filtering on single-cell DA analysis.a Effect size (Cohen’s d) of increasingly stringent log-fold change filtering on the AUCC between single-cell and bulk ATAC-seq DA, relative to the removal of an equivalent number of peaks selected at random (n = 16 comparisons). Inset text shows the median Cohen’s d. b As in a but for the AUCC between the ATAC and RNA modalities of single-cell multi-omics data (n = 306 comparisons). c As in a but for the number of false discoveries in null comparisons of published scATAC-seq data (n = 21 comparisons).We next examined the concordance of DA between the ATAC and RNA modalities of single-cell multi-omics data. This analysis recapitulated the risks of log-fold change filtering. In this experiment, we found that log-fold change filtering consistently decreased concordance, relative to random filtering, when removing up to 90% of genes (Fig. 5b and Supplementary Fig. 10).Last, we tested the impact of log-fold change filtering on the appearance of false discoveries. Relative to random filtering, we consistently observed more false discoveries across all DA methods when filtering by log-fold change (Fig. 5c and Supplementary Fig. 11).Experiment 6: Best practices for scATAC-seq analysisOur survey of the literature identified substantial discordance not just in the methods used for DA analysis, but even in the representations of scATAC-seq data that are provided as input to these methods (Fig. 1d). Indeed, fundamental concepts in the representation of scATAC-seq data remain subjects of ongoing debate. Perhaps the most notable among such concepts is whether to treat scATAC-seq data as a qualitative measurement by binarizing genome accessibility31,32. We asked whether the setting of DA could provide an opportunity to address this question. Specifically, we hypothesized that the accuracy of DA analysis could be used as a litmus test, in that approaches to scATAC-seq data preprocessing that enable more accurate and robust DA analysis are likely to also provide more accurate results at other steps in the analytical workflow. Accordingly, we repeated a subset of the analyses described above using binarized representations of the scATAC-seq datasets, with the aim of characterizing the effect of binarization on the biological accuracy, false discovery control, and biases of DA methods for scATAC-seq data.We first tested whether binarizing scATAC-seq data would improve the biological accuracy of DA analysis, as quantified by the concordance to matching bulk ATAC-seq. Surprisingly, we found that binarizing scATAC-seq data generally increased the concordance between scATAC-seq and bulk ATAC-seq data, and the magnitude of this improvement was largest for the worst-performing DA methods (Fig. 6a and Supplementary Fig. 12a). The relative performance of DA methods was largely unchanged by binarization, with the exception of SnapATAC’s findDAR test, which improved to rank among the top-performing methods when applied to binarized data.Fig. 6: Best practices for scATAC-seq analysis.a Area under the concordance curve (AUCC) for single-cell DA methods using matching bulk ATAC-seq as a reference, before and after binarization (n = 16 comparisons). Inset text shows the median AUCC. b Number of DA peaks detected between randomly assigned replicates at 5% FDR in random comparisons of published scATAC-seq data, before and after binarization (n = 21 comparisons). Inset text shows the median number of DA peaks. c As in b but in downsampled bulk ATAC-seq libraries. d As in b but in model-based simulations of scATAC-seq data. e Mean read depth of the top-1000 DA peaks identified by each single-cell DA method in published scATAC-seq datasets, before and after binarization. Inset text shows the median. f As in e but showing the proportion of cells in which these peaks are open. g As in e but showing the width of each peak. h Effect size (Cohen’s d) of alternative approaches to normalization of scATAC-seq data on the AUCC between single-cell and bulk ATAC-seq DA, relative to log-TP10K normalization (n = 16 comparisons). i As in h but showing the number of DA peaks detected between randomly assigned replicates at 5% FDR in randomized comparisons of published scATAC-seq data (n = 21 comparisons). j As in i but in downsampled bulk ATAC-seq libraries. k As in i but in model-based simulations of scATAC-seq data. l As in h but showing the mean read depth of the top-1000 DA peaks identified by each single-cell DA method in published scATAC-seq datasets. m As in l but showing the proportion of cells in which these peaks are open. n As in l but showing the width of each peak. o Number of cells considered per comparison, before and after controlling for technical covariates using the ArchR background-matching procedure (n = 322 comparisons). Inset text shows the median. p Area under the concordance curve (AUCC) for single-cell DA methods using matching bulk ATAC-seq as a reference, before and after controlling for technical covariates using the ArchR background-matching procedure (n = 16 comparisons). q Effect size (Cohen’s d) of the ArchR background-matching procedure on the AUCC between single-cell and bulk ATAC-seq.We next asked whether binarizing scATAC-seq data could mitigate the appearance of false discoveries. This analysis yielded ambiguous results (Fig. 6b–d and Supplementary Fig. 12b–d). A subset of DA methods, including SnapATAC’s findDAR test and negative binomial regression, consistently produced fewer false discoveries when applied to binarized data, although the magnitude of this decrease varied markedly across experiments. For the remaining DA methods, binarization did not have a consistent effect on the number of false discoveries.We then investigated whether binarization modulated the biases of DA methods. Interestingly, in published scATAC-seq data, binarization attenuated the biases of some DA methods towards calling highly accessible peaks as DA (Fig. 6e–f and Supplementary Fig. 13a, b), although this attenuation was less apparent for methods that aggregated cells to form pseudobulks, since these biases were not apparent to begin with. Moreover, we observed that, for virtually every DA method, binarizing scATAC-seq data led narrower peaks to be called as DA (Fig. 6g and Supplementary Fig. 13c).We additionally sought to specifically characterize the peaks that were spuriously called as DA before and after binarization, but found that binarization did not alter the more general tendency for false discoveries to preferentially arise from more accessible and wider peaks (Supplementary Figs. 14–16).Collectively, these results suggest that binarization can improve the biological accuracy of DA analysis, and this effect is observed for virtually every DA method included in our analysis. This observation is at odds with the argument that binarization is an unnecessary step that discards quantitative data collected in scATAC-seq experiments31,32. We suggest that this apparent paradox can be reconciled by the observation that binarization generally reduces the biases of DA methods towards more accessible and wider peaks, even for DA methods that are less affected by these biases in the first place, and that mitigating these biases can outweigh the potential negative effects of binarization.Another important open question in the analysis of scATAC-seq data is whether and how this data should be normalized prior to DA analysis. The majority of the DA methods considered in this study operate directly on count matrices or binarized data. However, three methods (t-test, Wilcoxon rank-sum test, and LRclusters) are generally applied to normalized data, with the expectation that normalization is required to produce biologically accurate results. We therefore surveyed the literature to identify the approaches to normalization that are most commonly applied to scATAC-seq data, and then evaluated the impact of normalization on the performance of these three DA methods.We first tested the impact of normalization on the biological accuracy of DA analysis, as quantified by the concordance to matching bulk ATAC-seq. In this analysis, no approach consistently improved over log-TP10K normalization (Fig. 6h and Supplementary Fig. 17a). The lone exception was the combination of TF-IDF normalization and the t-test, which demonstrated improved concordance relative to log-TP10K normalization.We then asked whether specific approaches to normalization could mitigate the appearance of false discoveries. We found that the number of false discoveries was consistently reduced after applying TP10K or TP(median) normalization (Fig. 6i–k and Supplementary Fig. 17b–d). However, these two approaches did not improve, and in some cases decreased, the concordance between single-cell and bulk ATAC-seq data (Fig. 6h). On the other hand, the combination of TF-IDF normalization with the t-test, which increased the concordance between single-cell and bulk ATAC-seq data, also increased the number of false discoveries.We further studied the relationship between normalization and the biases of DA methods. Interestingly, this analysis showed that numerous approaches to normalization can reduce the biases of DA methods towards highly accessible or broad peaks, relative to log-TP10K normalization (Fig. 6l–n and Supplementary Fig. 18). We additionally sought to specifically characterize the peaks that were spuriously called as DA under different normalization strategies, but this analysis failed to identify trends that were consistent across both DA methods and normalization strategies, and no method altered the global pattern whereby false discoveries preferentially arose from more accessible and wider peaks (Supplementary Figs. 19–26).Collectively, these results do not provide strong support for any alternative approach to normalization as compared to log-TP10K, which is the default approach implemented in several of the most widely used analysis packages33,35 for DA analysis. This conclusion is in line with a more systematic benchmark of normalization approaches to scRNA-seq data, which also highlighted the strong performance of log-TP10K normalization64.Last, we evaluated a procedure proposed by the authors of ArchR to control for technical artefacts in DA analysis. For any given set of ‘foreground’ cells (e.g., cells of a particular type), ArchR constructs a set of ‘background’ cells of equal size that are matched according to a series of technical properties. By default, TSS enrichment and log10(# of fragments) are used to select a matching set of ‘background’ cells. This approach inherently entails a trade-off whereby correcting for potentially confounding artefacts comes at the cost of using only a subset of the available data for analysis. Indeed, across all comparisons, we found that the ArchR background matching procedure substantially reduced the number of cells considered in any given DA analysis, from a mean of 4495 to 1264 cells per comparison (Fig. 6o). This decrease in the number of cells considered appeared to outweigh the positive impacts of controlling for technical variation, since we identified a uniform decrease in the AUCC across all DA methods considered (Fig. 6p–q).Experiment 7: Data requirements for single-cell DA analysisThe scale of scATAC-seq datasets is increasing exponentially (Fig. 1b). On one hand, the availability of hundreds of thousands of cells per dataset could increase the statistical power of DA analysis. On the other hand, past work in single-cell transcriptomics65 reported that the number of cell types identified per study was closely linked to the total number of individual cells sequenced in that study. This observation suggests that the increased statistical power afforded by a greater number of cells could be offset by an increased granularity of DA comparisons, each leveraging proportionally fewer cells. Moreover, many of the largest datasets are very sparse: for instance, the single-cell map of chromatin accessibility across 30 tissues reported by Zhang et al.26 demonstrated a median of just ~2800 fragments per cell. These trends raise the question of what minimum sequencing depth or number of cells are required for accurate DA analysis.To address these questions, we performed downsampling analyses of the sequencing depth or number of cells per dataset. We first studied the effect of sequencing depth on the biological accuracy of DA analysis by downsampling the plate-based datasets used in Experiment 1 to a mean of 500, 1000, 2000, 5000, or 10,000 counts per cell. We quantified the effect of downsampling on the AUCC by calculating Cohen’s d relative to the DA analysis on 10,000 counts per cell. We identified a decrease in the concordance between single-cell and bulk DA in each case, as expected, with no evidence of saturation, suggesting that DA analysis continues to benefit from deeper sequencing up to at least 10,000 fragments per cell (Fig. 7a and Supplementary Fig. 27a).Fig. 7: Data requirements for single-cell DA analysis.a Effect size (Cohen’s d) of downsampling plate-based scATAC-seq data to a mean of 500, 1000, 2000, 5000 counts per cell on the AUCC for single-cell DA methods using matching bulk ATAC-seq as a reference, relative to DA analysis of the same datasets with a mean of 10,000 counts per cell (n = 16 comparisons). Inset text shows the median Cohen’s d. b Effect size (Cohen’s d) of downsampling single-cell multi-omics data to 20, 50, 100, 200, or 500 cells per condition on the AUCC between the ATAC and RNA modalities, relative to DA analysis of the same datasets with 1000 cells per condition (n = 306 comparisons). Inset text shows the median Cohen’s d.We next investigated the minimum number of cells required for accurate DA analysis. To this end, we downsampled the number of cells within the droplet-based datasets used in Experiment 2. We limited our analysis to comparisons in which at least 1000 cells were sequenced in each experimental condition, and quantified the effect of downsampling on the AUCC by calculating Cohen’s d relative to the DA analysis on 1000 cells per condition. In this analysis, the AUCC began to saturate when exceeding 300 cells per condition, although the degree and rate of saturation varied across individual DA methods (Fig. 7b and Supplementary Fig. 27b). These findings suggest that accurate single-cell DA analysis can be achieved with as few as 50–100 cells, and that sequencing more than 300 cells per condition yields diminishing returns.Experiment 8: Scalability of single-cell DA methodsThe scale of single-cell epigenomics is continuing to increase. This observation underscores the need to evaluate the computational scalability of methods for DA analysis. Accordingly, we measured both the wall time and peak memory usage of each single-cell DA method in Experiments 1 and 2. We observed that the single-cell DA methods compared here varied by several orders of magnitude in both runtime and memory usage (Fig. 8a–d). Negative binomial regression, both variants of logistic regression, and the t-test typically required more than an hour to perform a single comparison within multiome datasets. Conversely, pseudobulk DA methods and SnapATAC’s findDAR test were consistently among the most time- and memory-efficient methods, suggesting these methods could most readily scale to datasets comprising millions of cells. Moreover, whereas the computational requirements of all DA methods scaled with the number of cells in the dataset, this relationship was attenuated for these same five methods (Supplementary Fig. 28).Fig. 8: Scalability of single-cell DA methods.a Wall clock time required by each DA method to execute each comparison in Experiment 1 (n = 16 comparisons). Inset text shows the median runtime in minutes. b Peak memory usage of each DA method while executing each comparison in Experiment 1 (n = 16 comparisons). Inset text shows the median peak memory usage in GB. c As in a but for each comparison in Experiment 2 (n = 306 comparisons). d As in b but for each comparison in Experiment 2 (n = 306 comparisons). e Wall clock time required by alternative implementations of three DA methods (t-test, Wilcoxon rank-sum test, and negative binomial regression). Inset text shows the median runtime in minutes. ***p < 10–15, two-sided paired t-test. f As in e but showing peak memory usage by each alternative implementation. Inset text shows the median peak memory usage in GB. ***p < 10–15, two-sided paired t-test.For some of the DA methods considered in our analysis, different implementations are provided in widely-used software packages, some of which have been reported to offer speed-ups of several orders of magnitude66. These methods include the Wilcoxon rank-sum test (Signac vs. ArchR), the t-test (Signac vs. ArchR), and negative binomial regression (Signac vs. glmGamPoi67). For these three DA methods, we tested both available implementations. We observed that optimized implementations of these methods can markedly reduce the computational resources required to execute any given DA analysis (Fig. 8e, f).Performance of DA methods across experimentsTo summarize our findings, we divided DA methods into top-, middle-, and bottom-performing terciles on each task that involved a quantitative comparison of DA methods (Fig. 9). This visualisation reinforced the excellent performance of pseudobulk DA methods across all of the experiments in this study. We suggest that these methods should be considered a first choice approach for the DA analysis of scATAC-seq data. We did not include mixed models in our primary analyses because they had not been used in a published study at the time of our literature review; however, we hypothesize that these may also offer improved performance relative to the methods that are most widely used at present (Supplementary Fig. 4e), albeit at the cost of substantially increased runtime and memory requirements49.Fig. 9: Summary of DA method performance across major evaluation criteria.Methods were grouped into terciles of high, low, or intermediate performance on the basis of their quantitative performance on each task, as described in the Methods, and ranked by their average performance across all criteria.Exploratory analysesDuring the peer review of this manuscript, the reviewers suggested additional analyses that had not been preregistered. In this section, we present the results of these post hoc, unregistered analyses.First, we found that the concordance between single-cell and bulk ATAC-seq data was substantially higher for peaks located in promoter regions than for those located in enhancers (Supplementary Fig. 1f). Whereas our data does not allow us to establish a definitive causal explanation for this phenomenon, we found that peaks in promoter regions tended to be open in a greater number of cells, and supported by a greater number of reads, as compared to promoter regions (Supplementary Fig. 29). These observations are congruent with observations made in both bulk and single-cell ATAC-seq data2,32 and likely explain, at least in part, the greater biological accuracy of DA analysis for peaks in promoter regions. When considered in combination with the observation that the accuracy of DA analysis continues to benefit with increased sequencing depth up to at least 10,000 fragments per cell (Fig. 7a), these observations raise the possibility that increased sequencing depth would specifically benefit DA analysis of enhancers, although this possibility was not directly tested in the present study.Second, we observed that filtering peaks by log-fold change between conditions did not consistently improve, and frequently decreased, the accuracy of DA analysis (Fig. 5). Our preregistered analysis involved removing equal proportions of peaks within each dataset on the basis of a percentile threshold. The question arose as to whether the imposition of a fixed fold-change threshold would affect our conclusions. Therefore, we repeated these analyses when filtering peaks with less than a two- or three-fold change between conditions. In these experiments, we again observed that log-fold change filtering did not improve the concordance of DA analyses of matched single-cell and bulk ATAC-seq (Supplementary Fig. 30), or between scATAC-seq and scRNA-seq from the same cells (Supplementary Fig. 31), and often substantially increased the number of false discoveries (Supplementary Fig. 32), all of which are consistent with the results of our initial, prespecified analyses.Third, the design of Experiments 1, 2, and 3 was preregistered before the results of Experiment 6 became available. In view of the observation that the performance of some DA methods benefited from binarization and/or alternative approaches to normalization, the question arose as to how these preprocessing strategies would affect the summary of DA methods shown in Fig. 9. A similar question arose with respect to the observation that the concordance between single-cell and matched bulk ATAC-seq was generally higher when calling peaks in performance in ‘pseudobulks’ created from single-cell data. To explore these points, we produced additional summary figures that compared DA methods across all primary and secondary analyses in Experiments 1 and 2 (Supplementary Fig. 33) and across all binarization and normalization scenarios in Experiment 6 (Supplementary Fig. 34). These analyses corroborated the observation that some DA methods consistently achieved very good or very poor performance: for instance, in Experiments 1 and 2, DESeq2-LRT ranked among the top tercile of DA methods in all but two analyses, whereas negative binomial regression, the binomial test, and Fisher’s exact test never ranked among the top tercile in any analysis. Considering all binarization and normalization approaches yielded rankings of DA methods that were broadly consistent with those shown in Fig. 9, but with some important differences: notably, SnapATAC::findDAR emerged as a top-performing DA method with respect to concordance with matched bulk ATAC-seq data, albeit not control of the false discovery rate, when applied to binarized data.

Hot Topics

Related Articles