ChromaFold predicts the 3D contact map from single-cell chromatin accessibility

Ethics statementGeneration of human ESC scATAC-seq data: Experiments were conducted per National Institute of Health (NIH) guidelines and approved by the Tri-SCI Embryonic Stem Cell Research Oversight Committee. Generation of mouse regulatory T cell Hi-C data: Animals were housed at the Memorial Sloan Kettering Cancer Center (MSKCC) animal facility under specific pathogen-free (SPF) conditions on a 12-h light/dark cycle. All studies were performed under protocol 08–10-023, approved by the MSKCC Institutional Animal Care and Use Committee. Generation of mouse germinal center B cell scATAC-seq data: The experimental procedures involving animals were executed in stringent accordance with the institutional guidelines delineated by Weill Cornell Medicine, as per the Guide for the Care and Use of Laboratory Animals, and standards established by the Association for Assessment and Accreditation of Laboratory Animal Care International. The Research Animal Resource Center, the Institutional Animal Care and Use Committee of Weill Cornell Medicine and Cornell Institutional Animal Care and Use Committee, having vetted all procedures, duly approved the entirety of the study involving mice under protocols #2011-0031 and #2017-0035. Generation of mouse hematopoietic stem cell scATAC-seq: All animal studies were performed on animal protocol #11-10-025 approved by the Institutional Animal Care and Use Committee (IACUC) at Memorial Sloan Kettering Cancer Center.Preprocessing of Hi-C and Micro-C dataWe used nine human and three mouse datasets (Supp. Table 1). For datasets provided in this study and those where a processed.hic file is not available online, Hi-C FASTQ files were aligned to hg38, hg19, or mm10 genomes, and reads that are duplicates or invalid ligation products were filtered out using the HiC-Pro43 pipeline (v3.1.0) with default settings. Hi-C contact matrices were binned at 10 kb resolution and normalized using the following approaches. ICE-normalized contact maps were calculated using the HiCExplorer44 package. The counts were log2 normalized using a pseudocount of 1. Z-score normalization was calculated by the HiC-DC+26 package. Specifically, HiC-DC+ models observed raw counts for interaction bins using negative binomial regression to estimate the expected count based on genomic distance, GC content, mappability, and effective bin size based on RE sites in the corresponding pair of genomic intervals.Preprocessing of scATAC-seq dataFor datasets provided in this study and those where the processed scATAC-seq fragment file was not available online, scATAC-seq FASTQ files were aligned to hg38, hg19, or mm10 and counted by Cell Ranger ATAC v1.2.045 with default parameters. Arrow files were created from the scATAC-seq fragments using ArchR v1.0.146. Specifically, we binarized sparse accessibility matrices binned into 500 bp tiles across the genome. Cells with fewer than 1000 fragments and TSS <4 were filtered out. Latent Semantic Indexing (LSI) was performed on the 25,000 top variable tiles identified after two iterations of “IterativeLSI” by ArchR. Tiles from non-standard chromosomes, chrM and chrY, were not included. Cells were clustered (method=Seurat, k.param = 30, resolution = 1) and visualized with UMAP47 (nNeighbors = 30) using 30 LSI components. For datasets with multiple cell types, we annotated and extracted the cell type of interest by computing the mean gene score of marker genes per cluster. This was cross-checked with cell type annotations provided by the original sources, if available.Peak callingFor peak calling of the scATAC-seq data, filtered fragments for cells in each dataset/cell population were aggregated and used as input to the MACS248 peak caller (parameters -f BED, -g 2.7e9, -no-model, -shift −75, -extsize 150, -q 0.05). Peaks were filtered using an IDR49 cutoff of 0.05. Peaks within 500 bp of each other were merged. A peak-by-cell count matrix was then created by ArchR.Bulk ATAC-seq data processingBulk ATAC-seq data were obtained from ENCODE50 in the form of bam files. Bam files from replicates were merged using samtools51, binned at 1 bp resolution for C.Origami, and RPKM normalized using the bamCoverage function in deepTools52 to generate bigwig files.CTCF ChIP-seq and motif score data processingWe obtained the CTCF motif scores from the CTCF R package27, an AnnotationHub resource that represents genomic coordinates of FIMO-predicted CTCF binding sites for human and mouse genomes. Specifically, CTCF motif scores were generated by scanning for all three JASPAR28 CTCF PWMs in genomic DNA sequence using FIMO25. CTCF ChIP-seq data were obtained from ENCODE in the form of bam files. Bam files from replicates were merged using samtools, binned at 50 bp resolution for ChromaFold and 1 bp resolution for C.Origami, and RPKM normalized using the bamCoverage function in deepTools to generate bigwig files. The log2 fold change from the control ChIP-seq in the corresponding cell types were computed using the bigwigCompare function in deepTools.ChromaFold input data processingChromaFold takes three inputs: pseudobulk chromatin accessibility, co-accessibility profiles across cells, and predicted CTCF motif score/CTCF ChIP-seq. The pseudobulk chromatin accessibility is obtained by aggregating the accessibility profile across single cells in a population binned at 50 bp, library-size normalizing, and log transforming with a pseudocount of 1. The co-accessibility is derived by first generating metacells to combat sparsity in scATAC-seq datasets, then calculating the Jaccard similarity between binarized accessibility profiles across metacells, binned at 500 bp. Metacells are generated using the same algorithm used by Cicero18. Specifically, to generate the co-accessibility input corresponding to the V-stripe region, we directly compute the co-accessiblity between the 500 bp genomic bins in the center 20 kb region of the input window with all 500 bp genomic bins flanking the center 10 kb region. The CTCF motif score for each 50 bp bin in the genome is defined as the maximum score assigned to any genomic region that overlaps at least 10 bp with the 50 bp bin.ChromaFold model architectureThe ChromaFold model consists of two feature extractors and a linear predictor module. The first feature extractor takes the pseudobulk accessibility and the CTCF motif score or ChIP-seq signal as two channels. This feature extractor consists of fifteen 1D convolutional layers followed by batch normalization and ReLU activation. Next, we perform outer-concatenation where the model transforms the resulting L × C matrix, where L is the length of the output vector and C is the number of channels, into a L × L × 2 C by performing point-wise concatenation of the output features. This operation allows the information from pairs of genomic bins to be joined together. We implement a skip connection with the input layer by average-pooling the input and transforming it into a 3D tensor via outer concatenation. After concatenation, the data is passed through three 2D convolutional layers followed by a linear layer to consolidate the extracted features, producing a latent representation of the two input tracks.The second feature extractor takes the co-accessibility data as input. For memory efficiency, we only compute the co-accessibility between the bins in the center 10 kb region with the rest of the bins in the 4.01 Mb region as input. We use three 1D convolutional layers followed by two residual blocks and three additional 1D convolutional layers. Finally, a linear layer consolidates the extracted features and produces a latent representation of the co-accessibility input. These latent representations of the genomic region are concatenated and passed through a final linear layer to predict the contact between genomic bin t and its neighboring bins within a 2 Mb distance, which corresponds to a V-shaped stripe (V-stripe) in the contact map centered at t.ChromaFold model trainingWe trained ChromaFold using data pooled from three cell types, IMR-90, GM12878, and HUVEC. Chromosomes 3 and 15 were used for validation, chromosomes 5, 18, 20, 21 were held out for testing and evaluating model performance, and the rest were used for training. For each V-stripe prediction centered at genomic bin t, the input is the 4.01 Mb region centered at t. During training, we randomly subsampled 500–5000 single cells and 400–1000 metacells from the population per iteration for pseudobulk accessibility and co-accessibility computation, respectively. This data-augmentation step was critical for improving model generalizability to datasets of varying quality and size. We injected additional variation into the input by randomly shifting by −50 or 50 bp. Since neither our input nor output data contain directionality information, we further reduce redundancies in our model by predicting only one side of the V-stripe, and we simply reversed the input to predict the other side (shared model weights). To improve model stability, we used a two-step approach and first train ChromaFold’s feature extractor 1 to predict the target contact map by appending a dummy linear predictor at the end. After convergence, we froze the weights for this part of the network while training feature extractor 2 and the final linear module. Genomic regions with low mappability were masked from training based on the total signal for each bin in the contact map. We took the training window to start and end 4 and 5 Mb after the chromosome starting location and before the ending location, respectively, to create buffer regions since ChromaFold requires 4.01 Mb windows as inputs. The prediction target is the HiC-DC+ normalized Z-score, with outlier target values clipped to lie between −16 and 16 to avoid training bias. We optimized the MSE loss using stochastic gradient descent. We trained the model for 30 epochs and implemented early stopping with a patience of 10 epochs, the learning rate of 1e-6 and weight decay 1e-3. The model was trained on a single NVIDIA Tesla V40 GPU for ~5 h when using one training cell type and ~14 h when using three training cell types.De novo contact map prediction in a new cell typeThe ChromaFold model trained on IMR-90, GM12878, and HUVEC can be directly applied to other cell types and species without retraining. To perform de novo contact map prediction, we supplied scATAC-seq data of the new cell type and predicted CTCF motif scores in the corresponding genome to ChromaFold. If CTCF ChIP-seq data was available for the test cell type, we could alternatively use the ChromaFold + CTCF ChIP-seq model.ChromaFold Hi-C contact map predictionTo generate the complete predicted contact map for each chromosome, we first performed inference and predicted the interaction between each genomic bin t and all its neighboring bins within a 2 Mb distance, producing a V-stripe. Since the input region is 4.01 Mb centered at the bin t, we zero-padded the input vectors if they extended beyond the chromosome edges. We combined the predicted V-stripes and averaged the two predictions for each genomic bin. Contact map prediction for one full chromosome took on average ~1.5 min on a standard GPU like NVIDIA Tesla V40.Distance-stratified correlationTo evaluate the overall performance of genome-wide chromatin contact map prediction, we computed the distance-stratified correlation between the experimental and predicted contact maps. The rationale for distance-stratification is to remove any remaining genomic distance effect and avoid inflating the correlation. Specifically, we computed the Pearson correlation for all interactions with genomic distance d for d from 0 to 2 Mb, for each chromosome. We then used a paired t-test53 to compare the performance between models. In the boxplot visualizations, each point represents the Pearson correlation averaged across genomic distance, per chromosome.Topologically associated domain (TAD) annotationsWe called TADs at 10 kb resolution using TopDom54 (v0.0.2) using w = 30 on normalized Hi-C contact maps and predicted contact maps and used the insulation scores to evaluate ChromaFold’s ability to predict TAD structures.Significant interactionsWe defined significant interactions at the genomic bin level as interactions with the top 10% HiC-DC + Z-scores per chromosome. For each chromosome and at each genomic distance (incrementing by 10 kb), we used AUROC and AUPRC to evaluate how well significant interactions are captured by ChromaFold’s predicted contact map. We used a paired t-test to compare the performance between models. In the boxplot visualizations, each point represents the corresponding metric averaged across genomic distance, per chromosome. To define significant peak-level interactions, we first mapped each peak to the overlapping genomic bin(s) at 10 kb resolution. If a peak overlapped two bins, it was assigned to both. Next, we labeled pairs of peaks as significantly interacting if the corresponding HiC-DC + FDR-corrected p value is less than 0.25. The distance-stratified AUROC and AUPRC were computed in a similar fashion as described above.Benchmarking against CiceroWe used Cicero to calculate co-accessibility for pairs of peaks. The same metacell groupings used for ChromaFold training/inference were used for running Cicero. We then used Cicero to calculate co-accessibility using a window size of 1 Mb and a distance constraint of 500 kb. We evaluated the performance of peak-level significant interaction prediction using Cicero co-accessibility at various cutoffs and compared that using ChromaFold-predicted contact maps. All evaluations of peak-level significant interactions were distance-constrained to 500 kb for comparison with Cicero.Benchmarking against C.OrigamiTo ensure a fair comparison, we re-trained ChromaFold (with CTCF motif score or with CTCF ChIP-seq) and C.Origami on the same cell type, IMR-90, towards HiC-DC+ normalized Hi-C contact maps and used the same chromosomes for training, validation (Chr10) and testing (Chr15) as specified in C.Origami16. The training procedure for ChromaFold was the same as described above, and that for C.Origami was the same as described in the original paper. C.Origami converged after training for 45 epochs. After training, we evaluated the performance of both models on the test chromosome in IMR-90, as well as in three held-out cell types GM12878, K562, and hES. For held-out cell types, we used the IMR-90-trained models but used GM12878/K562/hESC inputs to make de novo contact map predictions. For both models, we merged predictions into a chromosome-wide Hi-C contact map and evaluated the following metrics: (1) distance-stratified Pearson correlation, (2) distance-stratified bin-level significant interaction prediction, and (3) peak-level significant interaction prediction.Deconvolution of chromatin interactions in alpha and beta cells in the pancreatic isletChromaFold can be used for deconvoluting chromatin interactions in complex tissues. Using the scATAC-seq and bulk 3D contact map for pancreatic islet cells, we fine-tuned the pretrained ChromaFold model for 1 epoch on the training chromosomes to better adapt the model predictions to the dataset. We then applied the fine-tuned model to alpha and beta cell populations to achieve deconvolution. Specifically, we extracted the alpha and beta cell clusters from the scATAC-seq to use as input to ChromaFold to generate deconvolved contact map predictions. Next, we used the deconvolved contact maps to generate peak-level interaction predictions as described in the section above. We evaluated the deconvolved chromatin interaction predictions using an independent dataset with Hi-C of sorted human alpha and beta cell populations. For peak-level interaction visualization, we restricted to only interactions involving peaks that lie within 10 Kb of the TSS of the highlighted genes. The overall contact map prediction quality was evaluated using distance-stratified Pearson correlation. Significant bin- and peak-level interaction predictions were evaluated using distance-stratified AUROC and AUPRC.Single-cell ATAC sequencing data collectionHuman embryonic stem cells were harvested for single-cell multiome analysis with a targeted collection of ~7000 cells. Nuclei were isolated with Demonstrated Protocol Nuclei Isolation for Single-Cell Multiome ATAC+Gene Expression Sequencing_RevA. 500 K cells underwent lysis in 500 μl lysis buffer in ice for 3 min, then were subjected to the standard protocol for wash and counting. Single-cell Multiome libraries were generated with the 10x Genomics Chromium Next GEM Single-Cell Multiome ATAC + Gene Expression Kit following the manufacturer’s guidelines. The libraries were sequenced on the NovaSeq 6000 platform following the manufacturer’s guidelines.To collect scATAC-seq data in mouse hematopoietic stem cells (Lin-Kit+ cells), bone marrow cells were harvested from a total of n = 3 C57BL6 wildtype mice and subjected to red blood cell lysis. Bone marrow cells were then incubated with MACS beads (CD117, Miltenyi Biotec, 130-091-224). Then enriched c-Kit+ cells were collected by running AutoMACS (Miltenyi Biotec) according to the manufacturer’s instructions. The cells were then stained with a cocktail: Lineage marker (CD3, CD8, Gr1, B220, CD19, and Ter119)- PE-Cy5 (dilution 1:100), cKit-APC-Cy7 (1:100), and DAPI (1:5000). Live Lin-cKit+ cells were sorted on BD Aria machine. Freshly sorted cells were then resuspended in PBS + 0.04% BSA at around 300 k/250 ul, followed by scATAC-seq protocol.Hi-C data collectionIsolation of murine regulatory T cells was conducted as previously described55. The cell suspension was made from pooled secondary lymphoid organs (spleen; peripheral and mesenteric lymph nodes) of Foxp3-GFP mice56, and CD4 T cells were enriched using the Dynabeads Flowcomp Mouse CD4 Kit (Thermo Fisher, 11461D) according to manufacturer’s instructions. The resulting cells were stained with antibodies, washed extensively, and resuspended in isolation buffer (PBS w/ 2% FBS, 10 mM HEPES buffer, 1% l-glutamine, and 2 mM EDTA) containing 0.01% SYTOX Blue dead cell stain (Thermo Fisher, S34857) to facilitate dead cell exclusion, and sorted on a FACSAria (BD) instrument. Treg cells (TCRβ + CD4 + Foxp3-GFP+) and naïve conventional CD4 T cells (TCRβ + CD4 + Foxp3-GFP−CD44loCD62Lhi) were sorted by gating on the respective populations. Hi-C was performed as previously described57. Briefly, sorted T cell populations (~1 × 105) were cross-linked in 1% formaldehyde for 10 min and quenched in 125 mM glycine. Cross-linked cells were lysed, and chromatin was restriction enzyme digested using restriction enzymes that digest chromatin at ^GATC and G^ANTC, where N can be any of the four genomic bases (Arima Genomics, San Diego, CA). Digested chromatin was reverse cross-linked using NaCl and eluted in 20 uL 2X Shearing buffer (Covaris, Woburn, MA) and fragmented to 350 base pair fragments using a Covaris LE220Rsc sonicator (Covaris, Woburn, MA). Sheared genomic material was biotinylated and enriched using streptavidin beads. Genomic libraries were prepared to streptavidin-bound DNA using Arima protocol modifications for Accel-NGS 2S DNA plus library kit (IDT, Coralville, IA). After end repair and ligation, libraries were quantified using the KAPA library quantification kit (Roche, Indianapolis, IN) and PCR amplified for the number of cycles required to generate >4 nM per library. Hi-C libraries were sequenced on an Illumina NovoSeq at 500 M read depth, and raw sequencing data in the Fastq format were obtained.Germinal center B cell centrocytes and centroblast cells were sorted from the spleens of mice immunized with SRBCs for 8 days. Briefly, single-cell suspensions were stained with antibodies against B220 (BV786, BD 563894), CD95/Fas (BUV805, BD 741968), GL7 (AF647, BD 561529), CXCR4 (PE, BD 561734), and CD86 (PECy7, BioLegend 105014). Centrocytes (Live B220 + CD95/Fas+GL7 + CXCR4–CD86+) and centroblasts (Live B220 + CD95/Fas+GL7 + CXCR4 + CD86–) were FACS sorted. All antibodies were used at 1/500 dilution, except CXCR4 and CD86, which were used at 1/250 dilution in PBS + 2% FBS + 0.5 mM EDTA. DAPI was used at 1 µg/mL for the exclusion of dead cells. Cell sorting was performed in a BD Influx cell sorter in the Weill Cornell Medicine Flow Cytometry Core Facility. Flow-sorted CB and CC were fixed in 2% formaldehyde for 10 min. Fixation was quenched by the addition of 0.125 M glycine for 10 min. In situ Hi-C was performed as described (Rao et al. Cell 2014). Nuclei were permeabilized, and DNA was digested overnight with 100 U DpnII (New England BioLabs). The ends of the restriction fragments were labeled using biotin-14-dATP and ligated in a 1-ml final volume. After reversal of cross-links, ligated DNA was purified and sheared to a length of ~400 bp, at which point ligation junctions were pulled down with streptavidin beads, DNA fragments were repaired, and dA-tailed and Illumina adapters were ligated. The library was produced by 6–10 cycles of PCR amplification. Sequencing (paired-end, 50 bp) was performed in a HiSeq 2500 Illumina sequencer in the Weill Cornell Medicine Epigenomics Core.Statistics and reproducibilityNo statistical method was used to predetermine the sample size. In all cases, we held out chromosomes during training of the ChromaFold model and reported the model’s performance on the previously held-out test chromosomes. Cell-type-specific ChromaFold predictions were performed on pre-clustered cells using scATAC-seq data. Additionally, we conducted a down-sampling analysis and observed robust performance of ChromaFold with as few as 3000 randomly selected test cells.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

ChromaFold predicts the 3D contact map from single-cell chromatin accessibility

As U.S. election looms, this ‘rumor researcher’ tracks—and combats—falsehoods in real time

General-purpose foundation models for increased autonomy in robot-assisted surgery

Secure ancient animal remains from permafrost | Science

Age effects challenge psychiatric research | Science

Britain’s postwar sugar craze confirms harms of sweet diets | Science

Hot Topics

As U.S. election looms, this ‘rumor researcher’ tracks—and combats—falsehoods in real time

General-purpose foundation models for increased autonomy in robot-assisted surgery

Secure ancient animal remains from permafrost | Science

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

As U.S. election looms, this ‘rumor researcher’ tracks—and combats—falsehoods in real time

General-purpose foundation models for increased autonomy in robot-assisted surgery

Secure ancient animal remains from permafrost | Science

Age effects challenge psychiatric research | Science

Popular Articles

As U.S. election looms, this ‘rumor researcher’ tracks—and combats—falsehoods in real time

General-purpose foundation models for increased autonomy in robot-assisted surgery

Secure ancient animal remains from permafrost | Science