MOCHA’s advanced statistical modeling of scATAC-seq data enables functional genomic inference in large human cohorts

Inclusion and ethical considerationsAll study activities were approved by institutional review boards at the participating institutions where required. Informed consent was obtained from all participants at the Seattle Vaccine Trials Unit to participate in the study and to publish their corresponding research data. Two participants declined to publish their raw sequencing data. All human data are anonymized, and the human cohort data used in this manuscript had either prior IRB approval or were publicly available.Longitudinal COVID-19 cohortWe recruited in the greater Seattle area n = 18 participants (10 females and 8 males, aged 22–79 years) who tested positive (COVID+) for SARS-CoV-2 virus (Wuhan strain) and n = 23 uninfected (COVID-) participants (10 females and 13 males, aged 29–77 years) for our longitudinal COVID-19 study38, “Seattle COVID-19 Cohort Study to Evaluate Immune Responses in Persons at Risk and with SARS-CoV-2 Infection”. Due to recruitment during the pandemic, there is a bias towards first responders. All COVID+ participants had mild to moderate symptoms. Peripheral blood mononuclear cell (PBMC) and serum samples were collected from the COVID- participants at a single time point and from the COVID+ participants at 3–5 time points over a period of 1–121 days post-symptom-onset (PSO, total samples n = 70). Study data were collected and managed using REDCap electronic data capture tools hosted at Fred Hutchinson Cancer Research Center (FHCRC). The FHCRC Institutional Review Board (IRB) approved the studies and procedures. Informed consent was obtained from all participants at the Seattle Vaccine Trials Unit to participate in the study and to publish their corresponding research data. Two participants declined to publish their raw sequencing data. Sex of participants was determined based on self-reporting. Participants were not compensated for being in this study.COVID19 single-cell ATAC-seqPBMC isolationBlood collected in acid citrate dextrose tubes was transferred to Leucosep tubes (Greiner Bio One). The tube was centrifuged at 800–1000x g for 15 min and the PBMC layer recovered above the frit. PBMCs were washed twice with Hanks Balanced Solution without Ca+ or Mg+ (Gibco) at 200–400 × g for 10 min, counted, and aliquoted in heat-inactivated fetal bovine serum with 10% dimethylsulfoxide (DMSO, Sigma) for cryopreservation. PBMCs were cryopreserved at −80 °C in Stratacooler (Nalgene) and transferred to liquid nitrogen for long-term storage.FACS neutrophil depletionTo remove dead cells, debris, and neutrophils prior to scATAC-seq, PBMC samples were sorted by fluorescence-activated cell sorting (FACS) prior to cell permeabilization as described previously97. Cells were incubated with Fixable Viability Stain 510 (BD, 564406) for 15 min at room temperature and washed with AIM V medium (Gibco, 12055091) plus 25 mM HEPES before incubating with TruStain FcX (BioLegend, 422302) for 5 min on ice, followed by staining with mouse anti-human CD45 FITC (BioLegend, 304038, 1.0 uL/10e6 cells) and mouse anti-human CD15 PE (BD, 562371, 1.0 uL/10e6 cells) antibodies for 20 min on ice. Cells were washed with AIM V medium plus 25 mM HEPES and sorted on a BD FACSAria Fusion. A standard viable CD45+ cell gating scheme was employed: FSC-A x SSC-A (to exclude sub-cellular debris), two FSC-A doublet exclusion gates (FSC-W followed by FSC-H), dead cell exclusion gate (BV510 LIVE/DEAD negative), followed by CD45+ inclusion gate. Neutrophils (defined as SSC high, CD15+) were then excluded in the final sort gate. An aliquot of each post-sort population was used to collect 50,000 events to assess post-sort purity.Sample preparationPermeabilized-cell scATAC-seq was performed as described previously97. A 5% w/v digitonin stock was prepared by diluting powdered digitonin (MP Biomedicals, 0215948082) in DMSO (Fisher Scientific, D12345), which was stored in 20 μL aliquots at −20 °C until use. To permeabilize, 1 × 106 cells were added to a 1.5 mL low binding tube (Eppendorf, 022431021) and centrifuged (400 × g for 5 min at 4 °C) using a swinging bucket rotor (Beckman Coulter Avanti J-15RIVD with JS4.750 swinging bucket, B99516). Cells were resuspended in 100 μL cold isotonic Permeabilization Buffer (20 mM Tris-HCl pH 7.4, 150 mM NaCl, 3 mM MgCl2, 0.01% digitonin) by pipette-mixing 10 times, then incubated on ice for 5 min, after which they were diluted with 1 mL of isotonic Wash Buffer (20 mM Tris-HCl pH 7.4, 150 mM NaCl, 3 mM MgCl2) by pipette-mixing five times. Cells were centrifuged (400 × g for 5 min at 4 °C) using a swinging bucket rotor, and the supernatant was slowly removed using a vacuum aspirator pipette. Cells were resuspended in chilled TD1 buffer (Illumina, 15027866) by pipette-mixing to a target concentration of 2300–10,000 cells per μL. Cells were filtered through 35 μm Falcon Cell Strainers (Corning, 352235) before counting on a Cellometer Spectrum Cell Counter (Nexcelom) using ViaStain acridine orange/propidium iodide solution (Nexcelom, C52-0106-5).Tagmentation and fragment capturescATAC-seq libraries were prepared according to the Chromium Single Cell ATAC v1.1 Reagent Kits User Guide (CG000209 Rev B) with several modifications. 15,000 cells were loaded into each tagmentation reaction. Permeabilized cells were brought to a volume of 9 μl in TD1 buffer (Illumina, 15027866) and mixed with 6 μl of Illumina TDE1 Tn5 transposase (Illumina, 15027916). Transposition was performed by incubating the prepared reactions on a C1000 Touch thermal cycler with 96– Deep Well Reaction Module (Bio-Rad, 1851197) at 37 °C for 60 min, followed by a brief hold at 4 °C. A Chromium NextGEM Chip H (10x Genomics, 2000180) was placed in a Chromium Next GEM Secondary Holder (10x Genomics, 3000332) and 50% Glycerol (Teknova, G1798) was dispensed into all unused wells. A master mix composed of Barcoding Reagent B (10x Genomics, 2000194), Reducing Agent B (10x Genomics, 2000087), and Barcoding Enzyme (10x Genomics, 2000125) was then added to each sample well, pipette-mixed, and loaded into row 1 of the chip. Chromium Single Cell ATAC Gel Beads v1.1 (10x Genomics, 2000210) were vortexed for 30 s and loaded into row 2 of the chip, along with Partitioning Oil (10x Genomics, 2000190) in row 3. A 10x Gasket (10x Genomics, 370017) was placed over the chip and attached to the Secondary Holder. The chip was loaded into a Chromium Single Cell Controller instrument (10x Genomics, 120270) for GEM generation. At the completion of the run, GEMs were collected and linear amplification was performed on a C1000 Touch thermal cycler with 96–Deep Well Reaction Module: 72 °C for 5 min, 98 °C for 30 s, 12 cycles of 98 °C for 10 s, 59 °C for 30 s and 72 °C for 1 min.Sequencing library preparationGEMs were separated into a biphasic mixture through addition of Recovery Agent (10x Genomics, 220016); the aqueous phase was retained and removed of barcoding reagents using Dynabead MyOne SILANE (10x Genomics, 2000048) and SPRIselect reagent (Beckman Coulter, B23318) bead clean-ups. Sequencing libraries were constructed by amplifying the barcoded ATAC fragments in a sample indexing PCR consisting of SI-PCR Primer B (10x Genomics, 2000128), Amp Mix (10x Genomics, 2000047), and Chromium i7 Sample Index Plate N, Set A (10x Genomics, 3000262) as described in the 10x scATAC User Guide. Amplification was performed in a C1000 Touch thermal cycler with 96–Deep Well Reaction Module: 98 °C for 45 s, for 9 to 11 cycles of: 98 °C for 20 s, 67 °C for 30 s, 72 °C for 20 s, with a final extension of 72 °C for 1 min. Final libraries were prepared using a dual-sided SPRIselect size-selection cleanup. SPRIselect beads were mixed with completed PCR reactions at a ratio of 0.4x bead:sample and incubated at room temperature to bind large DNA fragments. Reactions were incubated on a magnet, and the supernatant was then transferred and mixed with additional SPRIselect reagent to a final ratio of 1.2x bead:sample (ratio includes first SPRI addition) and incubated at room temperature to bind ATAC fragments. Reactions were incubated on a magnet, the supernatant containing unbound PCR primers and reagents was discarded, and DNA-bound SPRI beads were washed twice with 80% v/v ethanol. SPRI beads were resuspended in Buffer EB (Qiagen, 1014609), incubated on a magnet, and the supernatant was transferred resulting in final, sequencing-ready libraries.Quantification and sequencingFinal libraries were quantified using a Quant-iT PicoGreen dsDNA Assay Kit (Thermo Fisher Scientific, P7589) on a SpectraMax iD3 (Molecular Devices). Library quality and average fragment size were assessed using a Bioanalyzer (Agilent, G2939A) High Sensitivity DNA chip (Agilent, 5067-4626). Libraries were sequenced on the Illumina NovaSeq platform with the following read lengths: 51nt read 1, 8nt i7 index, 16nt i5 index, 51nt read 2.Data preprocessingscATAC-seq libraries were processed as described previously97. In brief, cellranger-atac mkfastq (10x Genomics v1.1.0) was used to demultiplex BCL files to FASTQ. FASTQ files were aligned to the human genome (10x Genomics refdata-cellranger-atac-GRCh38-1.1.0) using cellranger-atac count (10x Genomics v1.1.0) with default settings. Fragment positions were used to quantify reads overlapping a reference peak set (GSE123577_pbmc_peaks.bed.gz from GEO accession GSE12357798), which was converted from hg19 to hg38 using the liftOver package for R99, ENCODE reference accessible regions (ENCODE file ID ENCFF503GCK100), and TSS regions (TSS ± 2 kb from Ensembl v93101 for each cell barcode using a bedtools (v2.29.1102) analytical pipeline.Quality controlCustom R scripts were used to remove cells with less than 1,000 uniquely aligned fragments, less than 20% of fragments overlapping reference peak regions, less than 20% of fragments overlapping ENCODE TSS regions, and less than 50% of peaks overlapping ENCODE reference regions. The ArchR package11 was used to assess doublets in scATAC data. Doublets were identified using the ScoreDoublets function using a filter ratio of 8, and cells with a Doublet Enrichment score exceeding 1.3 as determined by ArchR’s doublet detection algorithm11 were not considered for downstream analysis.Dimensionality reduction and cell type labelingWe used the ArchR package to generate a count matrix for a PBMC reference peak set98. Dimensionality reduction was performed using the ArchR addIterativeLSI function (parameters varFeatures = 10,000, iterations = 2), and the addClusters function was used to identify clusters in latent semantic indexing (LSI) dimensions using the Louvain community detection algorithm. For visualization, Uniform Manifold Approximation and Projection66 (UMAP) was performed using ArchR’s addUMAP function at the default settings. The ArchR addGeneIntegrationMatrix function (parameters transferParams = list(dims = 1:10, k.weight = 20)) was used to label scATAC cells using the Seurat level 1 cell types from the Seurat v4.0 PBMC reference dataset103. To generate clusters that more closely matched label transfer results, we performed K-means clustering on the UMAP coordinates using 3 to 50 cluster centers and identified a set of clusters that each had > 80% of cells sharing a single cell type identity. Almost all such clusters contained ≥98% cells from a single major cell type (T cells, B cells, NK cells, or monocytes/DCs/other), with the exception of a single cluster with 88% purity. We used clusters of the same major cell type to subset the data into T cells, B cells, NK cells, or monocytes/DCs/other for downstream analyses. For each major cell type, we repeated the same dimensionality reduction (LSI/UMAP) process on the scATAC-seq data with the same settings. We then performed a second round of label transfer, using the ArchR addGeneIntegrationMatrix function (same parameters as described above for level 1), to reach level 2 and 3 cell labelings of the Seurat PBMC reference dataset. These labels were consolidated into 25 cell types for most analysis, except for the co-accessibility analysis where 17 cell types were used to match the published promoter-capture HiC resource54. The median cell labeling score across all cells that passed quality control was 0.74.Three Human scATAC-seq datasets for MOCHA development and benchmarkingCOVID19 datasetTwo samples (1 COVID- sample, male; 1 COVID+ sample, female, collected on day 12 PSO) from our longitudinal COVID-19 cohort were lost due to low sample volume. The scATAC-seq data of the remaining samples was denoted as the COVID19 dataset (n = 91) in this study. After removing doublets and cells of poor quality, high quality data of 1,311,638 cells were obtained. The data was split into two overlapping subsets for some analyses: 1) A cross-sectional dataset (denoted as COVID19X, n = 39) included data of COVID- samples (n = 22, 10 females and 12 males) and the first samples of COVID+ participants (n = 17, 9 females and 8 males) during early infection (<16 days PSO). 2) A longitudinal dataset (denoted as COVID19L, n = 69) included all data for the 18 COVID+ participants (10 females and 8 males). The overlap between COVID19X and COVID19L was 17, which were the first samples of COVID+ participants. The full dataset (COVID19) can be accessed at GEO under accession number GSE173590.HealthyDonor datasetThis longitudinal scATAC-seq dataset44 was collected on 18 PBMC samples of 4 healthy donors (aged 29–39 years) over 6 weeks (1 female and 1 male, weeks 2–7; 2 males, weeks 2, 4, and 7). The donors had no diagnosis of active or chronic disease during the study. The data is publicly available at GEO under accession number GSE190992. We used the dataset as is, except we removed cells with doublet enrichment score exceeding 1.3, based on ArchR’s doublet detection algorithm11. High quality data of 145,711 cells were obtained. From this dataset, we consolidated existing annotations into 25 cell types with a published median cell labeling score of 0.78.Hematopoiesis datasetThis dataset was downloaded from (https://www.dropbox.com/s/sijf2votfej629t/Save-Large-Heme-ArchRProject.tar.gz). It consists of ~220,000 hematopoietic cells from the hematopoiesis dataset in ArchR11. As described in their Supplementary Table 1, the dataset was assembled from 49 samples in four data sources, of different sample types (mixed, sorted, and unsorted cells; PBMCs; and bone marrow mononuclear cells), and generated using different sample processing protocols and on different technical platforms. We used ArchR to generate doublet scores and removed clusters with both high doublet scores and a mixture of disparate cell types. This doublet removal only applied to sequencing wells that were not sorted, purified cells. In the end, data of 95,599 cells were obtained. Because many of the cell types were sorted populations run on an individual well, many cell types were not available across samples. As a result, we treated all cell types as coming from a single sample for benchmarking purposes. We used previously published cell annotations, with a median labeling score of 0.70.Mouse Pan-Organ scATAC AtlasThis dataset (GEO: GSE111586) was downloaded from https://atlas.gs.washington.edu/mouse-atac/. We performed doublet removal and then used the data for benchmarking purposes on non-human scATAC-seq datasets. The original cell type labels were used in the analysis.Assessing dataset noise using the Altius PeaksetTo assess ‘noise’ within a dataset (i.e., fragments from closed regions known as heterochromatin), we used the Altius consensus peakset100 of over 3.6 million DNase I hypersensitive sites within the human genome as an approximation of all potentially accessible sites. We calculated the overlap rate between fragments and the Altius consensus peakset for each cell type and dataset, in order to assess the quality of the data. The COVID19 dataset had an median Altius peakset overlap rate of 88.9%, while the corresponding rates for the HealthyDonor dataset and the Hematopoiesis dataset were 75.5% and 82%, respectively.MOCHA overviewMOCHA is implemented as an open-source R package under the GPLv3 license in CRAN. All code and development versions of MOCHA are available in MOCHA’s github repository.MOCHA is designed to run in-memory and interoperate with common Bioconductor methods and classes (e.g., RaggedExperiment, MultiAssayExperiment, and Summarized Experiment). It takes as input four objects that are commonly generated from scATAC-seq after cell labeling and the removal of doublets and cells of low quality data. These four objects are: 1) a list of GRanges or GRangesList containing per-sample ATAC fragments, 2) cell metadata with cell labels, 3) a BSGenome annotation object for the organism, and 4) a GRanges containing blacklisted regions. These inputs can be passed to MOCHA directly from an ArchR object. Alternatively, results can be extracted from Signac, SnapATAC, or ArchR, and converted to common Bioconductor data objects, which can then be imported into MOCHA. By operating on well-supported Bioconductor objects, MOCHA’s inputs and outputs are compatible with the broader R ecosystem for sequencing analyses, and are easily exportable to genomic file formats such as BED and BAM.MOCHA’s core functionality runs as a pipeline from these inputs to perform sample-specific open tile prediction and consensus analysis, resulting in a TSAM represented as a Bioconductor RangedSummarizedExperiment. On systems with sufficient memory, MOCHA’s functions can be parallelized over samples with the ‘numCores’ parameter to decrease runtime. From the TSAM, MOCHA provides functions for ZI co-accessibility and ZI differential accessibility analysis. The format of the TSAM output enables additional downstream analyses with other R packages. For example, the TSAM RangedSummarizedExperiment can be used directly as the counts matrix input for motif deviations analysis with the chromVAR R package. Additional details on the workflow and functions on the MOCHA package are provided in Supplementary Fig. 1.MOCHA objectsMOCHA’s workflow generates two major objects: an OpenTiles Object, and a Tile-Sample Accessibility Matrix (TSAM) object. An OpenTiles Object is created by callOpenTiles and stores tile accessibility information at the sample and cell type level, along with sample-specific metadata in a MultiAssay Experiment, which contains:From there, user-defined cutoffs are used to identify study-level open chromatin regions. These regions are used to generate tile-by-sample accessibility matrices (i.e., TSAMs) for downstream analysis. These TSAMs are saved in a SummarizedExperiment format, containing:These TSAM can then be integrated with ArchR, ChromVar, and other Bioconductor-compatible R packages for downstream analyses.Tiling the genomeMOCHA splits the genome into pre-defined, non-overlapping 500 base-pair tiles that remain invariant across samples and cell types. MOCHA annotates each tile using a user-provided transcript database (e.g., HG38 Transcript Database) as follows: Promoter regions are 2000 bp upstream and 200 bp downstream from transcriptional start sites104. Intragenic regions are tiles that fall within a gene body, but not within the promoter regions. All other regions are classified as distal.From there, we only consider tiles that overlap with ATAC fragments. MOCHA counts the number of fragments in a tile as follows$${f}_{i,j,k,t}= {\rm{the}}\; {\rm{number}}\; {\rm{of}}\; {\rm{fragments}}\; {\rm{on}}\; {\rm{sample}}\; {\rm{i}},\, {\rm{of}}\; {\rm{cell}}\; {\rm{type}}\; {\rm{j}},\\ {\rm{in}}\; {\rm{cell}}\; {\rm{k}},\, {\rm{overlapping}}\; {\rm{with}}\; {\rm{tile}}\; {\rm{t}}$$
(1)
$$\begin{array}{c}{F}_{i,j}={\rm{the}}\; {\rm{total}}\; {\rm{number}}\; {\rm{of}}\; {\rm{fragments}}\; {\rm{on}}\; {\rm{sample}}\; {\rm{i}},\, {\rm{of}}\; {\rm{cell}}\; {\rm{type}}\; {\rm{j}}\\ \end{array}$$
(2)
If a fragment falls between two tiles, it is counted on both tiles.NormalizationNormalization techniques using invariant CTCF sitesWe examined three normalizing approaches: dividing the number of fragments by 1) the total number of fragments for sample i, cell type j (i.e., ${F}_{i,j}$); 2) the total number of fragments for sample i (i.e., ${F}_{i}={\varSigma }_{j}{F}_{i,j}$), and 3) the total number of cells in sample i, cell type j. We evaluated the above normalization methods along with the raw data based on a list of 2230 cell-type invariant CCCTC-binding factor (CTCF) sites from the ChIP Atlas database105. These loci were identified in at least 201/204 (99%) of blood cell types present in the ChIP-seq Atlas database106. Using these CTCF sites, each approach was assessed based on the corresponding distribution of coefficient of variation (CV) in peak accessibility. MOCHA normalizes data using ${F}_{i,j}$.Sample- and cell type-specific normalizationFor each sample i, cell type j, and tile t, MOCHA calculates the following normalized features:$${{{\rm{\lambda }}}^{(1)}}_{{\rm{i}},{\rm{j}},{\rm{t}}} = \left({\varSigma }_{k}{f}_{i,j,k,t}/{F}_{i,j}\right)\times 1{0}^{9} \\ ={\rm{the}}\; {\rm{total}}\; {\rm{normalized}}\; {\rm{fragments}}\; {\rm{for}}\; {\rm{sample}}\; {\rm{i}},\, {\rm{cell}}\; {\rm{type}}\; {\rm{j}},\, {\rm{at}}\; {\rm{tile}}\; {\rm{t}}$$
(3)
$${{{\rm{\lambda }}}^{(2)}}_{{\rm{i}},{\rm{j}},{\rm{t}}}= (\max \{{f}_{i,j,k,t}\}_{k}/{F}_{i,j})\times 1{0}^{9} \\= {\rm{the}}\; {\rm{maximum}}\; {\rm{number}}\; {\rm{of}}\; {\rm{normalized}}\; {\rm{fragments}}\; {\rm{across}}\; {\rm{single}}\; {\rm{cells}},\\ {\rm{for}}\; {\rm{sample}}\; {\rm{i}},\, {\rm{cell}}\; {\rm{type}}\; {\rm{j}},\, {\rm{tile}}\; {\rm{t}}$$
(4)
Since the NK population used for model training contained 750 million fragments, a scaling factor of ${10}^{9}$ is applied to make the raw and normalized counts on the same scale across cellular abundances, and keep normalized values greater than 1 to minimize convergence errors in downstream model training. Biologically, λ(1)i,j,t is designed to capture the total number of fragments across all cells (e.g., pseudo-bulk), normalized by the sequencing depth for that cell type and sample. Given the sparsity of scATAC-seq data and the assumption of limited number of genomic copies (2x-4x) in a typical cell, λ(2)i,j,t is designed to capture the presence of multiple fragments in a tile from any cell, which can only be evaluated on single-cell data. This approach combines single cell and pseudo-bulk information for downstream prediction. Normalizing by ${F}_{i,j}$ is used to normalize both sequencing depth and cell population variability. This approach provides both a sample- and cell-type-specific normalization scheme.Evaluation of open chromatin accessibilityTraining of logistic regression models for predicting tile accessibilityMOCHA assumes a typical ploidy per cell (two to four copies of the genome). Its pseudocode and further details are provided in Supplementary Fig. 2 in order to allow for modifications when the above assumption does not hold.We used scATAC-seq data of 179k NK cells in the COVID19 (n = 91) dataset as the training dataset. First, we normalized the scATAC-seq data and collapsed it into pseudobulk data. Second, we applied MACS240 (‘-g hs -f BED –nolambda –shift -75 –extsize 150 –broad’, ‘–model -n’, using the parameters set in ArchR) to identify accessible peaks in the pseudobulk data. We chose these settings to match the behavior of MACS2 in a commonly-used scATAC data analysis software, modified to call broad rather than narrow peaks. The resulting peaks were then overlaid onto our pre-defined 500 bp tiles. We trim the broad peaks by 75 base pairs at each end and remove the tail ends of peaks that extend onto tiles with no signal. MACS2 identified 1.15 million tiles as ‘accessible regions’ and the remaining 4.39 million tiles with fragments were labeled as ‘inaccessible’. We labeled all other fragment-containing regions as inaccessible, and used these ‘accessible’ and ‘inaccessible’ regions for training. Third, we randomly selected NK cells at cell counts ranging from 170k to 5 at discrete intervals, generating 10 replicates for subsets <50k cells, and 5 replicates for larger subsets. In each of the subsets, we calculated λ(1)i,j,t and λ(2)i,j,t at individual tiles. Fourth, we trained a LRM for each selected subset of NK cells based on the tile labeling just described. The LRM was fitted using a logit ‘link’ function and applying the default loss function from the GLM logistic regression function in R. For each sample of ${n}_{a}$ cells, the LRM calculates a probability score to assess the likelihood of a tile being accessible, using the formula$${\Pr }^{({n}_{a})}({{\lambda }^{(1)}}_{i,j,t},{{\lambda }^{(2)}}_{i,j,t})=\frac{1}{1+\exp (-{b}_{0}^{({n}_{a})}-{b}_{1}^{({n}_{a})} * {{\lambda }^{(1)}}_{i,j,t}-{b}_{2}^{({n}_{a})} * {{\lambda }^{(2)}}_{i,j,t})}$$
(5)
Here ${b}_{0}^{({n}_{a})}$ is the intercept, ${b}_{1}^{({n}_{a})}$ and ${b}_{2}^{({n}_{a})}$ are coefficients for λ(1)i,j,t and λ(2)i,j,t, respectively. A tile is predicted as accessible if ${\Pr }^{({n}_{a})}\ge {\theta }^{({n}_{a})}$ or inaccessible if ${\Pr }^{({n}_{a})} \, < \, {\theta }^{({n}_{a})}$ where ${\theta }^{({n}_{a})}$ is the threshold value separating accessible and inaccessible tiles. Since it is not possible to balance open vs closed tiles in real-data, this leads to prevalence imbalances that need to be addressed. To account for this, we used the Youden index107 to calculate optimal ${\theta }^{({n}_{a})}$ in this study, using the cutpointR R package.Fifth, we collected $\{{b}_{0}^{({n}_{a})}\}$, $\{{b}_{1}^{({n}_{a})}\}$, $\{{b}_{2}^{({n}_{a})}\},\{{\theta }^{({n}_{a})}\}$ from the 10 or 5 replicated runs on n cells and then took the corresponding median coefficients, i.e., ${b}_{0}^{(n)}$, ${b}_{1}^{(n)}$, ${b}_{2}^{(n)}$, and ${\theta }^{(n)}$, to construct the predictive model for a sample of n cells. Finally, we used the learned coefficients and the learned thresholds to smoothen the model to interpolate the model across cellular abundances that the model was not trained on. The final model is composed of a set of smoothened coefficients $\{{b}_{0}^{(n)}\}$, $\{{b}_{1}^{(n)}\}$, $\{{b}_{2}^{(n)}\}$, and smoothened thresholds $\{{\theta }^{(n)}\}$ from all examined $n$.Model evaluation metricsTo quantify a model’s performance, we counted their true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs), and used these to calculate false positive rates, recall, and F1-score.$$ {\rm{False}}\; {\rm{Positive}}\; {\rm{Rate}}=1-{\rm{Precision}}=1-{\rm{TP}}/({\rm{TP}}+{\rm{FP}}),\\ {\rm{Recall}}={\rm{Sensitivity}}=1-{\rm{False} }\; {\rm{Negative}}\; {\rm{Rate}}={\rm{TP}}/({\rm{TP}}+{\rm{FN}}),\\ {\rm{F}}1 {\rm{score}}=2 * {\rm{TP}}/ (2 * {\rm{TP}}+{\rm{FP}}+{\rm{FN}}). \\ {\rm{Specificity}}=1-{\rm{FPR}} \\ {\rm{Youden}}{{\mbox{‘}}} {\rm{s}}\; {\rm{J}}={\rm{Sensitivity}}+{\rm{Specificity}}-1$$Prediction of tile accessibility on new dataTo predict accessibility in a new dataset, MOCHA first accounts for differences in sequencing depth and cell count across datasets and calculates the ratio, S, of the median (across samples) number of total fragments in the training data, and the corresponding median in the new dataset. MOCHA then scales both λ(1)i,j,t and λ(2)i,j,t in the new dataset by S and calculates the likelihood of a tile being accessible as$${\Pr }^{(n)}({{\lambda }^{(1)}}_{i,j,t},{\lambda }^{(2)}_{i,j,t},S)=\frac{1}{1+\exp [-{b}_{0}^{(n)}-{b}_{1}^{(n)} * (S\lambda ^{(1)}_{i,j,t})-{b}_{2}^{(n)} * (S{{\lambda }^{(2)}}_{i,j,t})]},$$
(6)
where $n$ is the number of cells of the targeted cell type in the targeted sample. The normalization and scaling factors enable the same smoothened coefficients to be used on new data, regardless of cell type or tile position. As before, a tile is predicted as accessible if ${\Pr }^{(n)}\ge {\theta }^{(n)}$ or inaccessible if ${\Pr }^{(n)} \, < \, {\theta }^{(n)}$ where ${\theta }^{(n)}$ is the threshold value separating accessible and inaccessible tiles.Benchmarking open regionsTo benchmark MACS2, HOMER, and MOCHA, we ran each tool per sample and cell type to generate comparable accessibility measurements across three cell types in three different datasets. For MACS2, we used the following parameters to call broad peaks (‘-g hs -f BED –nolambda –shift -75 –extsize 150 –broad –nomodel -n’), in accordance with previous published scATAC-seq settings11. For HOMER, we used the findPeaks function with default parameters, and added (‘-style histone’) to call broad peaks. While HOMER and MACS2 are primarily designed around the properties of ChIP-seq and DNase-seq, they are also recommended for use with bulk ATAC-seq41.To ensure head-to-head comparisons, we overlaid HOMER and MACS2’s peaks into MOCHA’s predefined 500 base-pair tiles to translate peak calls into open tile calls. Similar to training, we trimmed 75 bp off each end, as MACS2’s shift/extsize parameters extend fragments to improve peak calling under the -nomodel flag. By trimming, we avoid counting the tails of peaks that extend into tiles with no actual fragments. This trimming approach allows for a direct head-to-head comparison of open regions detected across methods. Additionally, all three methods are provided the same normalized pseudo-bulk intensity information to ensure comparable peak calling and prevent confounding peak calling and normalization. After translating MACS2 and HOMER peaks into open tiles, we then compared the number of open tiles per sample across all methods, cell types, and datasets.Next, we generated a TSAM for each cell type across all three methods. The TSAM is a matrix with an array-type structure, where each cell contains the normalized λ(1)i,j intensities for a given sample i, at tile j. We kept open tiles that were called in at least 20% of samples (or all tiles in Hematopoiesis). By generating a TSAM for each method, we compared reproducible, population-level open tiles across all three methods. The 20% threshold was applied to filter out noisy data.CTCF and TSS Sites for benchmarkingCTCF sites were drawn from the ChipSet Atlas105. In brief, we download a bed file containing CTCF peaks for all blood cell types, and then used Plyranges’s reduce_ranges108 function to collapse duplicate peak calls into one non-redundant and smaller file for detecting overlaps. This process was done for both Hg19 (n = 197,882) and Hg38 (n = 184,588). TSS sites were taken from Bioconductor database TxDb.Hsapiens.UCSC.hg19.knownGene for the Hematopoiesis dataset (which was aligned to Hg19), and TxDb.Hsapiens.UCSC.hg38.refGene for the other datasets by first extracting the transcripts for all genes. The TSS were then extracted from the transcripts using the promoters() command (Hg19, n = 62,265, Hg38, n = 88,819). We then calculated the number of tiles that overlapped with a CTCF and TSS site using the subsetByOverlaps function from the GenomicRanges104 R package.Runtime comparison on open chromatin analysisUsing the CD14 monocyte population in our COVID19 dataset (n = 91), we produced 13 subsamples ranging from 100,000 to 10 cells and measured 10 replicates of the time it takes to conduct open chromatin analysis. Our runtime comparisons were conducted on N2 machines on the Google Cloud Platform, with 64 vCPUs and 512GB RAM. MOCHA version 0.2.0 was used. The R package tictoc was used to record elapsed time.Downsampling comparison on open chromatinSince pooling cells across samples before calling open tiles is a common approach, we benchmarked all three methods on the same randomly selected cell subsets ranging from 5 cells to the full set in the COVID19 dataset (n = 91). For this comparison, we utilized the same three cell types across the same three datasets. For each cell type and dataset, we used the following procedures. For MOCHA:

Generate a coverage object using predefined 500 base-pair tiles on all pooled cells (e.g., CD16 monocytes).

Predict open tiles on the pooled cells.

Count the total number of open tiles, the number of open tiles overlapping with CTCF sites, and the number of open tiles overlapping with TSSs.

Repeat (1–3) for all pre-specified downsampled cell counts.

For MACS2 and HOMER: 1. Generate a coverage file on all pooled cells (e.g., CD16 monocytes), 2. Call peaks on the pooled cells. 3. Convert the peak regions onto the pre-defined MOCHA tiles. 4. Remove tiles that extend onto empty regions. 5. Count the total number of open tiles, the number of open tiles overlapping with CTCF sites, and the number of open tiles overlapping with TSSs. 6. Repeated (1–5) for the pre-specified downsampled cell counts.Simulating fragments and open chromatin from scATAC-seqTo benchmark HOMER, MACS2, and MOCHA, we designed a simulation study to determine the sensitivity, specificity, and accuracy of each method. The simulated scATAC-seq data is constructed in 6 steps as follows:Step 1: Defining open and closed tiles as the ground truth. We first splitted each chromosome into 500 bp tiles. We then assigned roughly 200,000 open tiles across the genome, with the number of open tiles per chromosome being proportional to the chromosome’s length. Afterwards, we distributed the open tiles along each chromosome with a sinusoidal probability (negative values set to 0) with a periodicity of 10 tiles, to mimic periodic appearances of open regions followed by large closed regions. All other tiles were defined to be closed. We treated these open and closed tiles as the ground truth in this simulation analysis. In this step, each sampling draws a different peakset and thus mimics a cell type with its own peakset.Step 2: Simulating the number of fragments in individual cells. Each simulated cell was randomly allocated a mean of 4k fragments, following a Poisson distribution ($\lambda$ = 4000). To simulate our quality control (QC) thresholds for sequencing depth, we excluded cells with fewer than 1000 fragments. The process was repeated until a target number of simulated cells was reached.Step 3: Assigning fragments to open and closed tiles within each cell. Simulating a FRIP (fraction of reads in peaks) score of 95%, a random number of fragments were assigned to the open tiles using a Poisson distribution ($\lambda$ = 0.95*fragment number) while the rest fragments were randomly assigned to the closed tiles with equal weight. For the open tiles, we first simulated their relative weight using a Beta distribution (1,3) and then allocated the corresponding fragments to individual open tiles according to their relative weights. These weights were set to simulate real data of many low accessible tiles and few highly accessible ones.Step 4: Positioning fragments in individual tiles. To determine the precise location of fragments assigned to open tiles, we first assigned the midpoint of each fragment to the center of the corresponding tile and then used a Poisson distribution ($\lambda$=5) to add a random offset from the center. The direction of the offset (left or right) was randomly generated with a Bernoulli distribution (p = 0.5). For fragments assigned to closed tiles, their position was randomly distributed across ‘closed’ tiles.Step 5: Determining the length of individual fragments. We simulated the length of individual fragments following a Poisson mixture distribution so that approximately 90% of fragments have a mean length of 75 bp, and 10% with a mean length of 200 bp.Step 6: Assembling simulated scATAC-seq data. All simulated fragments from all simulated cells were treated as a scATAC-seq sample, saved in a single fragment file, and converted into a normalized bedgraph file for peak calling.Using the above process, we ran two sets of simulations: 1) varying the number of cells to model cell types of different abundances, and 2) varying the total number of open regions to mimic cell types of different open tiles. For the first set, we simulated samples with a range of cell counts: 75, 100, 150, 200, 250, 500, 1000, 1500, 2000, 3000, and 5000 cells. Tiles having no fragments were removed from individual samples, resulting in different ground truth tiles per sample. This process was repeated 10 times for each cell count. The pairwise overlap rate between the 10 peaksets sampled in Step 1 ranged between 8.4% to 8.6%. 110 simulated samples were generated.For the second set of simulations, we fixed the number of cells (250) but varied the number of fragments per cell (1k, 1.5k, 2k, 2.5k, 3k, 3.5k, 4k, 4.5k) and the location and number of open regions (150k, 250k, 350k). The three peaksets sampled in Step 1 overlapped at rates between 6.3% to 14.8%. For each peakset, 10 samples were simulated at each value for fragments per cell, generating a total of 240 simulated samples. This enables the simulation of different cell types (total open regions and/or fragment variations due to ploidy variation) and technical influences from sequencing depth (fragments per cells). Based on existing vignettes and documentation, we note different tools apply different QC cutoffs to filter out low quality cells: ArchR=1k, Signac=1k, SnapATAC=3k. To cover a wide range of conditions, we simulated fragments per cell from 4.5k (our internal datasets) down to 1k, where ArchR and Signac would retain approximately 50% of cells.We then ran all three methods (MOCHA, MACS2, & HOMER) to identify open tiles, in the following manner. For MACS2 and HOMER, we called peaks and converted the open peakset into 500 bp tiles for direct comparison across methods, and removed tiles that minimally overlap the peakset (<75 bp overlap) to avoid tiling artifacts. Finally, we compared the open tiles from each method with the ground truth using previously described model evaluation metrics.Assessment on zero-inflation in pseudobulk scATAC-seq DataTo assess ZI in pseudobulk scATAC-seq data, we fitted a negative binomial (NB) distribution on pseudobulked accessibility λj(1) at each open tile. We then tested whether λj(1) was ZI using the DHARMa R package, which utilizes the model fits from the glmmTMB R package. More specifically, the test compared the observed number of zeros with that expected from a NB distribution: An estimate of >1 means that there are more zeros than expected by a NB model and a p < 0.05 means that the observed and the expected zeros in the data is significantly different. To be comprehensive, we ran the test using two standard parameterizations of the NB family, as implemented in the glmmTMB package: The variance grows either linearly (NB1) or quadratically (NB2) with the mean. The λj(1) was ZI if p < 0.05 and statistic > 1, or underinflated if p < 0.05 and statistic <1.Differential accessibility analysisMOCHA’s zero-inflated method for DAAMOCHA identifies differential accessibility tiles (DATs) in a targeted cell type between sample groups A and B in three steps:First, similar to others10,11, MOCHA prioritizes tiles for testing using heuristic functions to calculate two data-driven thresholds. MOCHA transforms the total fragment count {λ(1)} in the corresponding TSAM to log2(λ(1) + 1) and fits a mixture model of two normal distributions on all log2(λ(1) + 1) values in the TSAM (Supplementary Fig. 2g). This bimodal model provides a heuristic threshold to prioritize high-signal tiles. From there, we used the TSAM metadata to identify any differences in sequencing depth by comparing the median number of fragments per sample between groups. This analysis informs the ZI threshold. Given our initial observations of a 25% difference in fragment counts, we set a 50% threshold (2X the observed sequencing depth difference) to control for technical artifacts. Tiles that do not pass either threshold are assigned a DAT P value of NA, and those passing thresholds are then tested for differential accessibility.Second, MOCHA tests for differential accessibility as follows. Denote the percentages of zeroes among samples of the two groups as ρA and ρB and the corresponding medians of non-zero log2(λj(1) + 1) values as μA and μB. MOCHA then tests whether a tile is a DAT based on the following hypothesis testing:Null hypothesis (${H}_{0}$): ρA = ρB and μA = μBAlternative hypothesis (${H}_{1}$): ρA ≠ ρB or μA ≠ μB.MOCHA uses the two-part Wilcoxon (TP-W) test35 to combine results from the binomial test on ρA and ρB with results from the Wilcoxon rank-sum test on μA and μB. Since each test statistic can be transformed to follow a ${\chi }_{1}^{2}$ distribution (i.e., ${\chi }_{1,\rho }^{2}$ and ${\chi }_{1,\mu }^{2}$), MOCHA combines them into a single test statistic, i.e., ${\chi }_{2}^{2}\,$= ${\chi }_{1,\rho }^{2}$ + ${\chi }_{1,\mu }^{2}$, and consequently evaluates from it a single P value34. In the absence of zeros, the TP-W test mathematically reduces to the standard Wilcoxon rank-sum test.Finally, to control for multiple testing, MOCHA calculates a modified q-value to adjust for False Discovery109 for each tile and uses a default threshold of 0.2 to identify DATs. Since the P values are inflated near 1 (see Supplementary Fig. 10a), the background is estimated from P ≤ 0.95 only.In addition, MOCHA uses the Hodges-Lehmann estimator110 to estimate log2(fold change) on chromatin accessibility between the two sample groups. More specifically, MOCHA first calculates the difference between each sample pair (one sample each from group A or B) having non-zero log2(λ(1) + 1) values and then takes the median from all paired differences as an estimate for log2(fold change) between the two sample groups.Benchmarking MOCHA with ArchR, Signac, DESeq2, and DiffChipL on DAAArchR’s, Signac’s, DESeq2’s, and DiffChipL’s DA modules were each run on a single cell count matrix generated from the same tile set (215,649 tiles) as the COVID19X CD16 monocytes TSAM. For ArchR, default settings were used, except we modified maxCells to include all cells (n = 24,744). For Signac, we lowered the minimum percent detection (pct = 0.001), and the log2FC threshold (logfc.threshold = 0.05) in order to test the full tileset, thus enabling a full head-to-head comparison. As a close analog of Signac’s tutorial, we also set latent.vars to ‘nFrags’ to adjust for sequencing depth. For DESeq2 and DiffChipL, the single cell matrix was pseudobulked by sample to generate raw counts, as required for DESeq2’s and DiffChipL’s workflow before running differentials. Default settings for normalization and differentials were used, in line with both method’s tutorials.To assess the false positive rate (FPR, 1 – specificity) of the methods, we conducted 50 permutation tests in which labels for COVID+ and COVID- samples were randomized and significant DATs were identified as in the real data in each test. All DATs thus identified were considered as false positives (FPs) and the FPR was evaluated as the ratio of the number of FPs to the number of the original DATs from the real data.Due to a lack of “ground truth”, it is not possible to accurately evaluate the recall (sensitivity) of the methods. Nevertheless, we estimated a lower bound of the recall by downsampling the samples from n = 39 to n = 38, 37, …, 30. More specifically, we randomly selected the targeted number of samples, identified the DATs, and estimated the recall as the ratio of the number of the new DATs to the number of the original DATs. We repeated the process 15 times for each targeted sample size. To save computing time, we limited the analysis to tiles in chromosome 4 only. Some of the false negatives may be specific to removed samples, due to disease/human heterogeneity. The loss of power due to sample size reduction also increases the number of false negatives. Thus the recall evaluated in the approach is likely a lower bound.Power analysis when downsamplingWe conducted a power analysis to illustrate the theoretical power loss when reducing the sample size from n = 39 to n = 30. We first calculated the statistical power of detecting differences with a moderate effect size (d = 0.5) under an $\alpha$=0.05 using a 2-sample t-test (COVID+, n = 17; COVID-, n = 22). We then downsampled from n = 39 to n = 30 and calculated the power with d = 0.5 and $\alpha$=0.05 while maintaining the COVID+/COVID- sample ratio as close to 17/22 as possible. Finally, we divided the statistical power at each n, with the power observed at n = 39, to calculate relative power loss due to downsampling.Assessing discriminative power per methodWe randomly subsampled 50 DATs from the output of each method, ran K-means clustering (K = 2), and generated the following confusion matrix to summarize the predictions. COVID+COVID− Cluster 1aba + bCluster 2cdc + d a + cb + da + b + c + dWe then calculated Holley’s50 $G=\frac{(a+d)-(b+c)}{a+b+c+d}$ to assess how well the 50 randomly selected DATs in separating COVID+ and COVID- samples. We used |G| for the comparison since it is irrelevant which cluster is enriched for COVID+ samples. We repeated this process 1,000 times to obtain a distribution for each method.DAA runtime comparisonTo evaluate each method’s speed in DAA, we started by testing all 215,649 tiles in the CD16 monocytes TSAM, and gradually decreased the number of tested tiles. At each downsample, we tracked the run time required to identify DATs within those randomly selected tiles. The tictoc R package was used to calculate runtime.Co-accessibility analysisMOCHA’s zero-inflated method for CAAMOCHA applies ZI Spearman correlation36,37 to evaluate the co-accessibility of two tiles (e.g., X and Y) across either cell types or samples based on the corresponding log2(λ(1) + 1) values. More specifically, the method first calculates the standard Spearman correlation on (X,Y) pairs of non-zero data (i.e., X > 0 and Y > 0), denoted as ${\rho }_{S,11}$, and then adjusts it in the presence of zeroes in either tile as follows:$${\rho }_{S}^{*}={p}_{11}{p}_{+1}{p}_{1+}{\rho}_{S,11}+3({p}_{00}{p}_{11}-{p}_{10}{p}_{01}),$$
(7)
where$${p}_{00}={\rm{P}}({\bf{X}}=0,{\bf{Y}}=0),$$$${p}_{10}={\rm{P}}({\bf{X}} \, > \, 0,{\bf{Y}}=0),$$$${p}_{01}={\rm{P}}\left({\bf{X}}=0,{\bf{Y}} \, > \, 0\right),$$$${p}_{11}={\rm{P}}({\bf{X}} \, > \, 0,{\bf{Y}} \, > \, 0),$$$${p}_{+1}={p}_{01}+{p}_{11},$$$${p}_{1+}={p}_{10}+{p}_{11},$$which quantify how zeros are distributed among the two tiles across all data points with ${p}_{00}+{p}_{10}+{p}_{01}+{p}_{11}=1$. In the absence of zeros, the ZI-Spearman correlation reduces to the standard Spearman correlation, i.e., ${\rho }_{S}^{*}={\rho }_{S,11}$. MOCHA makes two modifications to an R implementation of the method28: 1) The Spearman correlation (${\rho }_{S,11}$) is calculated in C language for optimal computing time and 2) undefined ZI Spearman correlations (when ${\rho }_{S,11}$ cannot be calculated) are assigned to NA rather than replacing them with the standard Spearman correlations with zeros treated as normal data.Benchmarking inter-cell-type co-accessibilityWe used a previously published promoter-capture HiC (pcHiC) resource54 which identified promoter-enhancer regulatory links. From there, we used the liftOver R package, version 1.22.0, and the Hg19 to Hg38 conversion file (hg19ToHg38.over.chain, https://hgdownload.soe.ucsc.edu/gbdb/hg19/liftOver/) to convert promoter/enhancer loci from HG19 to HG38. The large HiC windows for Promoters and enhancers were then tiled into 500 bp windows to generate all potential promoter-enhancer tile (PET) pairs. We only kept PET pairs when both tiles were identified as accessible by MOCHA in naive CD4+ and CD8+ T Cells and had pcHiC evidence supporting their interaction specifically in naive CD4+ and CD8+ T cells. We used this final list of 1.2 million PET pairs as a foreground set of regions that are enriched for biologically relevant interactions within these cell types. We then generated a background set of randomly selected pairs of tiles to generate an empirical null distribution. We calculated the Spearman and ZI-spearman correlations across 17 cell types to identify pairs of regions that are co-accessible in T cells vs other cell types (17 cell types include B intermediate, B memory, B naïve, CD14 Mono, CD16 Mono, CD4 Effector, CD4 Naïve, CD8 Effector, CD8 Naïve, DC, HSPC, MAIT, NK, NK Proliferating, NK_CD56bright, OtherT, and Treg). We then used the Kolmogorov–Smirnov (KS) distance to quantify how well the two correlation methods separated the PET pairs from the random pairs. We then calculated an empirical P value for each PET pair based on the foreground and the empirical null distributions, and then accounted for multiple comparisons (Benjamini-Hochberg method). A PET pair was considered as significant if FDR < 0.1.Pathway enrichment analysisPathway enrichment analysis was mostly restricted to the Reactome pathway database. All genes within the database reference of TxDb.Hsapiens.UCSC.hg38.refGene from Bioconductor111 were selected as the background. Over-representation analysis was performed using WebGestaltR112. We annotated enriched pathways at the highest level within the Reactome’s database hierarchy. Lower level annotations on immune system pathways were provided to discern adaptive, innate, and general signaling pathways. Using WebGestaltR, pathway enrichment analysis was performed once on Wikipathways, Gene Ontology (Biological Processes, Non-redundant), and KEGG for illustrative purposes.Linkage disequilibrium score regression analysisLinkage Disequilibrium Analysis was performed using the codebase from https://github.com/bulik/ldsc and reference files from https://zenodo.org/records/8292725. GWAS summary statistics were pulled from EMBL’s GWAS catalog for studies GCST90029015113 (Autoimmune disease), GCST90038603114 (Immune Aging), and GCST90043674115 (Abnormal Immune Response). Using the LDSC scripts, munge_sumstats.py was used to generate summary statistics files for heritability analysis. Peak calls on 9 cell types (across 3 datasets) by all 3 methods (total of 27 peaksets) from Fig. 2 were split up by chromosome for computing efficiency and the make_annot.py was used to generate an annotation file for each before modeling (Source Data for Fig. 2). LDSC was then run (–l2 flag) on each bed file of peak calls using reference files from above, generating a file that could be used along with the reference baseline LD scores for heritability analysis on each GWAS summary file. Overlap between methods was then counted, when the same annotation was enriched in different method’s peaksets from the same cell type, in the same dataset, and using the same GWAS study’s results.Identification of alternatively regulated transcription start sitesWe extracted all TSSs from the Transcript database TxDb.Hsapiens.UCSC.hg38.refGene found on BioConductor111, and then expanded them upstream by 125 bp to account for TSSs falling very close to a tile boundary. We filtered out genes with only one TSS. If alternative TSSs of the same gene occurred within a user-defined neighborhood (default: 150 bp) of each other, we collapsed them into a single TSS. We then found the intersection between alternative TSSs and the 6211 DATs between COVID+ and COVID- samples in CD16 monocytes. TSSs that landed on a DAT were assigned with the FDR (q-value) of the corresponding DATs. We categorized alternatively regulated genes (ARGs) as

Type I: A gene had a subset of TSSs showing differential accessibility (FDR < 0.2) in the same direction and another subset being open but not differential.

Type II: A gene had at least two TSSs showing differential accessibility (FDR < 0.2) but in opposite directions.

Motif enrichment analysis on alternatively regulated TSSs and associated regulatory regionsMotif matching was done using the motifmatchr package and the CISBP motif database, as provided by the chromVARmotif package (https://github.com/GreenleafLab/chromVARmotifs). MOCHA uses a standard hypergeometric test to identify enriched motifs, with a user-provided foreground and background tile sets. For multi-testing corrections, the resulting p-values were converted into FDRs (Benjamini–Hochberg method). To understand the upstream signaling mechanisms regulating ARGs, we first applied MOCHA to identify tiles that were within ±1 M bp of and also co-accessible (inter-sample, ZI-Spearman correlation > 0.5) with the corresponding DATs. These DATs and their co-accessible tiles were selected as the foreground tile set. For the background tile set, we chose all tiles with TSSs and their co-accessible tiles that did not overlap with the foreground set. These foreground and background tile sets were used to calculate CISBP motif enrichment regulating ARGs.Ligand-motif set enrichment analysisThe NicheNet56 database has identified links between upstream ligands and downstream transcription factors (TFs) that regulate gene expression56. Using the same principle as pathway enrichment analysis, we designed a Ligand-Motif Set Enrichment Analysis (LMSEA) framework to capture potential drivers of our observed motifs (i.e., ligands regulate the TFs in our dataset). Specifically, LMSEA tests whether motifs linked to a ligand of interest are significantly (using hypergeometric test) over-represented in our observed motifs relative to the ligand’s motif set within NicheNet. The Benjamini and Hochberg (BH) procedure was used to adjust P values for multiple comparisons. An FDR value < 0.05 was considered significant.Construction of ligand-transcription factor-gene networkWe constructed ligand-TF-gene networks and visualized them using Cytoscape116. The nodes were ARGs, enriched motifs (TFs), and enriched ligands. Edges were drawn as follows: a motif-gene link was created if an enriched TF was found within the TSS-containing DATs of an ARG or their co-accessible tiles, a ligand-motif link was drawn if a ligand was known to interact with a TF in NicheNet’s ligand-transcription matrix.Longitudinal analysis of COVID-19 response at single-cell levelGrouping COVID+ samples by infection stageCOVID+ samples (n = 69) in the COVID19 dataset were grouped by the corresponding infection stage, including early infection (1–15 days PSO, n = 21), late infection (16–30 days PSO, n = 13), and recovery ( > 30 days PSO, n = 35).Generation of density UMAPWe extracted sample-specific open tiles on CD16 monocytes for all samples in the full COVID19 dataset (n = 91). From there, we generated a TSAM by aggregating all tiles that were called in at least 20% of samples at any infection stage or uninfected. We extracted the tiles from the resulting TSAM and added them to the original ArchR project via addPeakSet. We then generated a single-cell peak matrix from this tile set, using addPeakMatrix, and used it as input for ArchR’s iterative LSI and UMAP functions. The LSI was run with default parameters, except for the number of iterations (5 instead of 2). The UMAP was run on standard ArchR settings11 on the resulting iterative LSI object. Based on the resulting single-cell UMAPs, we generated a density plot for each infection stage or uninfected.Pseudotime trajectory analysisWe used ArchR’s standard Monocle3 pipeline to conduct a trajectory analysis. We instructed Monocle to construct a trajectory from cells belonging to samples in the order of early infection, late infection, recovery, and uninfected. The resulting trajectory was overlaid on the single-cell UMAP. Following the above trajectory, three distinct pseudotime heatmaps were generated using ArchR’s standard protocol and the following input single-cell matrices: log2-normalized GeneScores, peak (tile) accessibility, and ChromVAR z-scores. Using ArchR’s functions with default settings, we further extracted pseudotime-changing elements for each of the three matrices.Longitudinal analysis of COVID-19 response at pseudo-bulk levelLongitudinal analysis of motif usageWe modeled longitudinal motif usage using pseudobulk ChromVar motif z-scores. We converted the TSAM of CD16 monocytes from the COVID19L dataset (n = 69) into a ChromVAR-compatible object, and then ran ChromVAR on the TSAM-derived object to generate sample-level motif z-scores. We then modeled motif usage with the following generalized linear mixed effect model (GLMM):$${lmer}(z \sim {Age}+{Sex}+{x}_{i}+{{x}_{i}}^{2}+(1{\rm{|}}{Subject}),\, {data}={mydata})$$
(8)
where ${x}_{i}$ was the centered days PSO (i.e., days PSO of individual samples minus the mean days PSO of all samples)117 and $(1{|Subject})$ indicated that random intercepts were used for individual participants. The P values associated with the linear ${x}_{i}$ terms were extracted using the lmerTest package. We converted the P values to FDRs (Benjamini-Hochberg method) to control multiple testing. Motifs with a FDR < 0.1 were considered as significantly changing in time.Transcription factor networkThe activator protein-1 (AP-1) family network was obtained by subsetting the APID protein-protein interaction database118 down to just the significant AP1-family TFs. Edges between nodes were included if they were supported by at least four experiments. The nodes were color-coded using the signs of the corresponding coefficient of ${x}_{i}$. The network was drawn using Cytoscape116.Longitudinal analysis of gene promoter accessibilityWe collected promoter tiles from the TSAM of CD16 monocytes in the COVID19L dataset and modeled their accessibility using either GLMMs or ZI-GLMMs with the glmmTMB package33. More specifically, for promoters with zeroes, we applied the ZI-GLMM modeling as follows$$ {glmmTMB}\left({Log}\_{Acc} \sim {Age}+{Sex}+{Time}+(1|Subject),\right.\\ {zi}=\sim 0+{CellCounts},\\ \left. {data}={mydata},\,{family}={gaussian}()\right),$$
(9)
where ${Log\_Acc}$ was short for log2(λj(1) + 1), ${Time}$ was days PSO, $(1{|Subject})$ indicated that random intercepts were used for individual participants, and ZI was modeled as a function of the total cell counts in individual samples with no intercept. For promoters without zeroes, we applied the GLMM modeling as follows$$ {glmmTMB}\left({Log}\_{Acc} \sim {Age}+{Sex}+{Time}+(1|Subject),\right.\\ \left. {zi}=\sim 0,\,{data}={mydata},\,{family}={gaussian}()\right),$$
(10)
where the ZI component was omitted. The P values associated with ${Time}$ were extracted and converted to FDRs (Benjamini-Hochberg method) to control multiple testing. Promoters with a FDR < 0.1 were considered as significantly changing in time. For promoters attributed to multiple genes, all genes were included for pathway enrichment and downstream analyses.Transcription factor and gene promoter associationsLinking transcription factor to gene promotersWe evaluated whether motif z-scores were statistically associated with gene promoter accessibility via ZI-GLMMs as follows$$ {glmmTMB} \left( {Log}\_{Acc} \sim z,{zi}=\sim z,\,{data}={mydata},\right. \\ \left. {family}={gaussian}()\right),$$
(11)
where ${Log\_Acc}$ was short for log2(λj(1)+1). All pairs of significantly changing TFs and significantly changing gene promoters were evaluated. We considered a TF and a gene promoter to be associated if the continuous coefficient of $z$ was statistically significant (P < 0.05) without adjusting for multiple testing.Linking transcription factor to innate immune pathwaysUsing the TF-gene promoter associations, we calculated the percentage of significant genes in an innate immune pathway being associated with a TF. For visualization purposes, the network only displayed an edge between a TF and a pathway if more than 33% of significant genes in that pathway were associated with the TF.Variance decomposition on longitudinal COVID19 datasetTo identify significant covariates for modeling accessibility, we ran variance decomposition on highly accessible tiles (defined as having at least 80% non-zero values within any infection stage). We modeled Age, Sex, time (i.e., days since symptom onset), Batch, and Donor as random effects using the following formula:$${lmer} \left(\right.{Log} \_{Acc} \sim (1{\rm{|}}{Age})+(1{\rm{|}}{Sex})+(1{\rm{|}}{time})+(1{\rm{|}}{Batch})+(1{\rm{|}}{Donor}),\\ {\rm{data}}={\rm{my}}\; {\rm{data}},\\ {family}={gaussian}(),\\ \left. {\rm{REML}}={\rm{T}}\right)$$and then we used the relative variances to identify the most influential factor for each tile.Modeling zero-inflated associations across covariatesTo identify factors associated with high presence of zeroes in scATAC pseudo bulk data, we ran the following zero-inflated models$$ {glmmTMB}\left({Log}\_{Acc} \sim {Age}+{Sex}+{time}+(1{{|}}{PTID}),\right.\\ {zi}=\sim 0+{CellCounts}+{Age}+{Sex},\,{data}={mydata},\\ {family}={gaussian}(),\\ \left.{\rm{REML}}={\rm{T}}\right)$$on all promoter tiles to test whether zeroes were associated with Cell Counts, Age, and Sex. We extracted the p-values from the glmmTMB and summarized the associations using the histograms of the p-values.Statistics and reproducibilityThe datasets used in this manuscript were selected for benchmarking purposes, and therefore there is no need for sample-size calculation, randomization, or blinding. No samples were excluded from the analyses. Any computational data cleaning and processing are described accordingly.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

MOCHA’s advanced statistical modeling of scATAC-seq data enables functional genomic inference in large human cohorts

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Delineating cell types with transcriptional kinetics

Cornell researchers develop machine-learning diagnostic models that use cell-free molecular RNA

Hot Topics

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’

Delineating cell types with transcriptional kinetics

Popular Articles

Learning long sequences in spiking neural networks

Friday links: evidence vs. scientific reforms, p-values vs. speed limits, Delft Daphnia, and more

Scale Biosciences and partner CZI to propel RNA sequencing innovation in ‘100 Million Cell Challenge’