Integration of scHi-C and scRNA-seq data defines distinct 3D-regulated and biological-context dependent cell subpopulations

Developing a computational method to integrate scHi-C and scRNA-seq dataTo comprehensively integrate scHi-C and scRNA-seq data, we developed a novel computational method, a multiomic data integration (MUDI) algorithm, to precisely define 3D-regulated cell subpopulations or TISPs (Fig. 1a). We first identified distinct scHi-C clusters from scHi-C data, and scRNA-seq clusters from scRNA-seq data, respectively. We then integrated these two types of clusters by the MUDI algorithm (see Methods: Integration of scHi-C and scRNA-seq data) to precisely define the distinct TISPs (Fig. 1a). Briefly, we first defined topologically conserved associating domains (CADs) representing the conserved 3D chromatin structure of any individual scHi-C cluster. We then integrated CADs with differentially expressed genes (DEGs) of each of scRNA-seq clusters to derive TISPs by implementing an empirical quantitative formula to calculate an integration score of the interaction frequency and the gene expression values. We tested our MUDI on two cell types: pluripotent stem cells WTC11 from 4D Nucleome Project of Bing Ren Lab and breast cancer cells MCF7 generated from this study. From scHi-C data, nine scHi-C clusters (CC1–CC9) were identified with variable relative contact probability (Fig. 1b, c and Supplementary Fig. 1a–c), where CC1/3/5/7 and CC2/4/6/8/9 are majorly composed of WTC11 cells and MCF7 cells, respectively. From scRNA-seq data, ten scRNA-seq clusters (DD1–DD10) were classified with variable fold changes of differentially expressed genes (DEGs) (Fig. 1d, e). DD1/2/4/5/7/8/9 and DD3/6/10 are majorly composed of WTC11 cells and MCF7 cells, respectively. Our MUDI was initially able to identify four TISPs (WMG1-WMG4) with the distinct subpopulation features based on the number (M) of data types (here M = 2) and the number of (N) of cell types (here N = 2), such that WMG1 is the subpopulation with integration of CC1/3/5/7 and DD1/2/4/5/7/8/9, WMG2 is the subpopulation with the integration of CC1/3/5/7 and DD3/6/10, WMG3 is the subpopulation with integration of CC2/4/6/8/9 and DD1/2/4/5/7/8/9, WMG4 is the subpopulation with integration of CC2/4/6/8/9 and DD3/6/10 (Supplementary Fig. 1d, e). More importantly, the MUDI is further designed to be tailored to a biological-context dependent integration, such that the number of TISPs can be optimized according to a particular biologically meaningful factor on individual studies. Since Yamanaka Factors, MYC, POU5F1, SOX2, KLF4, were used to characterize the stem cell differentiation, we were able to obtain 12 distinct TISPs (Fig. 1f and Supplementary Fig. 2a), where one of subpopulations YFG1 was enriched with REACTOME developmental biology signaling pathway (Supplementary Fig. 2b, c), suggesting this subpopulation has high stemness and strong chromatin activities.Fig. 1: Development of a computational method for integrating scHi-C and scRNA-seq data.a Flowchart of Multiomic Data Integration (MUDI) algorithm. DEGs differentially expressed genes, TADs topologically associating domains. b Nine scHi-C clusters (CC1–CC9) identified from scHi-C data of WTC11/MCF7. c Relative contact probability of scHi-C clusters. d Ten scRNA-seq clusters (DD1–DD10) identified from scRNA-seq data of WTC11/MCF7. e Fold changes of DEGs of scRNA-seq clusters. f Integration scores of 12 topologically integrated subpopulations (TISPs), YFG1-12. Values in box plot of (c–f) from big to small are maxima, the 75th percentile, median, the 25th percentile and minima. Source data are provided as a Source Data file.To further demonstrate the sensitivity and robustness of the MUDI, we have first performed a sub-sampling analysis on WTC11 cells and MCF7 cells (Supplementary Fig. 3a). We found that compared to the whole set of 277 cells, it showed no significant difference of the overlapped CADs in each cluster for the subset of 75% (208) cells and the subset of 50% (138) cells, respectively, but significant difference for the subset of less than 25% (69) cells. Therefore, our MUDI algorithm is sensitive to at least half of cells. We then tested the MUDI on sn-m3c-seq data47 and scRNA-seq data48 generated from human brain tissues. We first identified scHi-C clusters from human cortex sn-m3c-seq data (Supplementary Fig. 3b) and scRNA-seq clusters from human cortex scRNA-seq data (Supplementary Fig. 3c), respectively. Upon the integration, we identified 24 TISPs for the excitatory neurons (Supplementary Fig. 4a). We not only captured the ground truth TISPs but also identified new transition TISPs (Supplementary Fig. 4b, c). Similarly, we identified 16 TISPs for the inhibitory neurons (Supplementary Fig. 4d–f) including both the ground truth TISPs as well as new transition TISPs. Furthermore, our MUDI was successfully applied in three datasets with significantly different sequencing depths, including (1) sn-m3C-seq data of human prefrontal cortex tissue with an average of 1.2 M contact pairs per cell, (2) scHi-C data of WTC11 cells with an average of 10.5 M contact pairs per cell, and (3) our newly generated scHi-C data of three breast cancer cells with an average of 36.4 M contact pairs per cell (see next four sections). Our MUDI has been able to identify computationally significant and biologically meaningful TISPs, suggesting that our algorithm was much less dependent on the sequencing depth. In summary, we have developed a novel and powerful method, MUDI, to precisely define 3D-regulated and biological-context dependent cell subpopulations.Generating high quality scHi-C and scRNA-seq data in a breast cancer cell model systemIn order to further test and demonstrate the biological-context dependent utility of MUDI, we have generated high quality scHi-C and scRNA-seq data in a breast cancer cell model system, MCF7, MCF7M1 and MCF7TR cells (Fig. 2a), a model system routinely used in the lab46. A total of 293 cells (89 MCF7 cells, 91 MCF7M1 cells, 113 MCF7TR cells) were used for scHi-C profiling (Supplementary Fig. 5a) and 22,425 cells (6172 MCF7 cells, 10,156 MCF7M1 cells, 6097 MCF7TR cells) were used for scRNA-seq profiling (Supplementary Fig. 5b). Single-cell chromatin contacts with very high quality were obtained (Supplementary Fig. 5c) upon preprocessing scHi-C data (Supplementary Fig. 5d, e and Supplementary Data 1), The combined scHi-C data showed a significant correlation with population Hi-C data, i.e., correlation coefficient r = 0.43 for combined single cells MCF7 to population MCF7, r = 0.61 for combined single cells MCF7M1 to population MCF7M1, and r = 0.58 for combined single cells MCF7TR to population MCF7TR, respectively. The correlations were weak among combined single cells, i.e., correlation coefficient r = 0.05 for combined single cells MCF7 to combined single cells MCF7M1, r = 0.28 for combined single cells MCF7M1 to combined single cells MCF7TR, r = 0.07 for combined scHi-C MCF7 to combined scHi-C MCF7TR, respectively (Fig. 2b). Genomic distance dependent contact probability showed markedly characteristic shapes of combined single cells (Fig. 2c, upper left) and individual single cells (Fig. 2c, upper right, lower left, and lower right panels). We also observed that the single cells had highly variable TADs but with more superimposing of cells, the enriched TADs have more similar features of population TADs (Fig. 2d–f). These results demonstrated a high quality of scHi-C data had been successfully produced in cancer cells. Since single-cell omics-seq data are generally sparse, an optimal resolution is needed for the downstream analysis. Our scHi-C data have a low slope of ratio of read pairs to square of bin numbers until the resolution reaches to 1 Mb (Supplementary Fig. 5f), thus the 1 Mb resolution was used for clustering of scHi-C data.Fig. 2: Generation of high quality scHi-C and scRNA-seq data in a breast cancer cell model system.a Workflow for the identification of 3D chromatin structures of breast cancer cell lines at single-cell resolution. b Pearson correlation coefficients of combined scHi-C data with population data. scMCF7: combined MCF7 scHi-C data. scMCF7M1: combined MCF7M1 scHi-C data. scMCF7TR: combined MCF7TR scHi-C data. c Genomic distance dependent contact probability. The thick lines are combined single cells and the thin lines are individual single cells. Superimposing single-cell TADs with 5 or 20 cells compared to population Hi-C TADs d for MCF7, e for MCF7M1, and f for MCF7TR, respectively. All TADs were generated at the resolution of 100 Kb contact map. Source data are provided as a Source Data file.To exclude the effect of structure variations (SVs), we performed single-cell DNA-seq on three breast cancer cell lines each with a biological replicate: 33 MCF7 cells, 33 MCF7M1 cells and 39 MCFTR cells with a total of 105 cells. We found that (1) there was no clear difference on copy number variations (CNVs) among single cells (Supplementary Fig. 5g), (2) scHi-C contacts in the genomic regions where 10% cells had CNVs had a very low ratio (almost zero) and (3) there was not any significant difference between MCF7 cells and MCF7TR cells (Supplementary Fig. 5h). These results illustrated that single-cell level SVs didn’t significantly influence the chromatin contacts.Defining the characteristic single-cell 3D chromatin structureBefore performing scHi-C clustering, we first examined our scHi-C data quality by comparing it with publicly available human scHi-C data. The breast cancer cells from our study were clearly separated from other types of human cells, leukemia cells K56227 and two pluripotent stem cell types, WTC11C6 and WTC11C28 (4D Nucleome Project, Bing Ren Lab) (Fig. 3a and Supplementary Fig. 6a, b). Furthermore, three stages of breast cancer cells, MCF7, MCF7M1 and MCF7TR were also distinctly located in different spaces defined by first three eigenvectors (Fig. 3b, c and Supplementary Fig. 6c). This analysis further validated the high quality of our scHi-C data. We then applied scHiCluster36 to identify an optimal nine scHi-C clusters, C1 to C9 (Fig. 3d) since the peak of the Silhouette coefficient is at 9 (Supplementary Fig. 6d). We removed the cells with the contacts lower than 6 in 1 Mb bins to minimize the false positive rate (Supplementary Fig. 7a–d) and thus obtained a good quality of 231 cells (87 MCF7 cells, 54 MCF7M1 cells and 90 MCFTR cells). Of nine clusters, a majority of cells in C2 and C7 were MCF7, a majority of cells in C1, C3, C4, C8, C9 were MCF7TR, and the cells in C5 and C6 were miscellaneous of three stages of cells (Fig. 3e). Interestingly, C1 and C5 had the smallest size of TADs and the most numbers of TADs (Fig. 3f and Supplementary Fig. 8a), while MCF7M1 cells had smaller sizes of TADs than MCF7 and MCF7TR cells did (Supplementary Fig. 8b, c).Fig. 3: Definition of the characteristic single-cell chromatin structure.a Comparing our scHi-C data with public human scHi-C data. PC1, PC2 and PC3 are first three eigenvectors. b 2D view of scHi-C data of breast cancer cells. c 3D view of scHi-C data of breast cancer cells. d Nine clusters (C1–C9) identified from scHi-C data of breast cancer cells. Each cluster is labeled with oval and assorted colors. e Number and the composition of single cells in individual scHi-C clusters. f The size of TADs of clusters. *: Two-sided Wilcoxon rank-sum test. g The shifted boundaries of TADs of CADs and NADs when TAD bin size is 50 K, 100 K, 200 K, 300 K, 400 K or 500 K. *: Two-sided Wilcoxon rank-sum test. Values in box plot of (f) and (g) from big to small are maxima, the 75th percentile, median, the 25th percentile and minima. Source data are provided as a Source Data file.Although Higashi37 was able to increase our scHi-C data to 20 Kb resolution, there was no significant correlation between cell-type specific TADs and cell-type specific gene expression for each of three breast cancer cell types (Supplementary Fig. 9a–d). Therefore, to better characterize chromatin domains in single-cell resolution, we proposed a novel framework for analyzing 3D chromatin domain behavior among single cells and defined a CAD which is the common 1 Mb genomic region shared by all individual cells within any particular scHi-C cluster that has very high chromatin contact probabilities. Indeed, CADs showed lower shifted boundaries of TADs and greater standard deviations than non-conserved associating domains (NADs) (Fig. 3g and Supplementary Fig. 10a). CADs had different characteristics from NADs in each of nine clusters. For example, CADs in C1 showed the highest shifted boundaries in compared to NADs at 100Kb TAD size (Supplementary Figs. 10b–d and 11a–f), and there were the most CADs either in all cells or per cell for C1, C3, C5, and C9 (Supplementary Fig. 12a, b). Our results thus elucidated that the newly defined CAD is the characteristic single-cell 3D chromatin structure useful for functional analysis of scHi-C clusters.Precisely identifying distinct 3D-regulated cancer cell subpopulationsTo precisely identify the 3D-regulated cancer cell subpopulations, we further conducted scRNA-seq data (Supplementary Fig. 13a, b) with the replicates showing a highly identical pattern in MCF7, MCF7M1 and MCF7TR cells (Supplementary Fig. 13c). We then identified 13 scRNA-seq clusters, D1–D13 (Fig. 4a), in which a majority of cells in D2, D6, and D11 are MCF7, a majority of cells in D1, D4, D5, D8, D9 and D10 are MCF7M1, a majority of cells in D3, D7, D12, D13 are MCF7TR (Fig. 4b). We also identified a gene signature of differentially expressed genes (DEGs) for each of 13 clusters (Fig. 4c and Supplementary Data 2). Interestingly, we found that the cell cycle signaling was among the top enriched pathways from the top 2000 variably expressed genes (Supplementary Figs. 13d and 14a) and the standardized variance of cycling genes is much higher than that of housekeeping genes (Fig. 4d and Supplementary Data 3). More specifically, there were much more cycling genes within DEGs in D3, D5, D7, D8, D10 as well as within CADs in C1, C3, C5, C9 than other scHi-C or scRNA-seq clusters (Fig. 4e). Remarkably, cycling signaling has been used to characterize cancer persister cells, a rare subpopulation of DTCCs with a reversible property45. We thus grouped scHi-C clusters into five categories based on the breast cancer cell stage and the number (high: >9; low: =<9) of cycling genes within CADs: (1) C1, C5—miscellaneous cells with high cycling genes; (2) C6—miscellaneous cells with low cycling genes; (3) C3, C9—resistant cells with high cycling genes; (4) C4, C8—resistant cells with low cycling genes; (5) C2, C7—sensitive cells with low cycling genes. Miscellaneous cells either with high cycling genes (C1, C5) or with low cycling genes (C6) showed higher contact probabilities than sensitive cells (C2, C7) (Supplementary Fig. 14b, c). On the contrary, resistant cells regardless of with high (C3, C9) or low (C4, C8) cycling genes had lower contact probabilities than sensitive cells (C2, C7) (Supplementary Fig. 14d, e). Although both Categories (1) and (3) have high cycling genes, miscellaneous cells (C1, C5) have more contact probabilities than resistant cells (C3, C9) (Supplementary Fig. 13f). We then computed an integration score within MUDI program to integrate five scHi-C categories with four scRNA-seq categories, and thus precisely defined 20 TISPs, G1-20, each representing a 3D-regulated breast cancer cellular state by an integration score (Fig. 4f).Fig. 4: Precise identification of 3D-regulated and biological-context dependent cancer cell subpopulations.a Thirteen scRNA-seq clusters (D1–D13) identified from scRNA-seq data of breast cancer cells. b Number and the composition of single cells in individual scRNA-seq clusters. c Gene expression heatmap of DEGs of scRNA-seq clusters. d The standardized variance of cycling genes and housekeeping genes in top 2000 variable genes. *: Two-sided Wilcoxon rank-sum test. Values in box plot from big to small are maxima, the 75th percentile, median, the 25th percentile and minima. e The distribution of CADs in scHi-C clusters and DEGs in scRNA-seq clusters according to the number of cycling genes and the number of housekeeping genes in each cluster. Green line is the cutoff for high cycling genes and low cycling genes. f Twenty topologically integrating subpopulations (TISPs) (G1–G20) dependent on the number of cycling genes and cell compositions of the scHi-C clusters and scRNA-seq clusters.Characterizing specific topologically integrated subpopulationsWe further examined a few of the TISPs related to cycling genes. Despite both G1 and G9 had high cycling genes in both CADs of scHi-C clusters and DEGs of scRNA-seq clusters, G1 had a higher integration score than G9 (Fig. 5a and Supplementary Fig. 15a). In addition, some of G1 and G9 genes were marked with super-enhancers (Supplementary Fig. 15b, c). Interestingly, G1 genes were enriched with a REACTOME chromatin modifying enzyme signaling pathway and these enriched enzymes had higher integration scores in G1 than those in G9 (Fig. 5b, c). Of 15 enriched genes, ATXN7, ENY2, PRMT6, KDM5B, KMT5A, MBIP, SMARCB1, TADA3 occurred in G1 and G9, BRWD1, CCND1, ELP2, HMG20B, JADE1, KMT2E, MORF4L1 in G9 (Supplementary Fig. 15d). Higher expression of chromatin modifying enzymes in breast cancer patient cohorts showed a lower recurrence-free survival (Fig. 5d and Supplementary Fig. 15e–k). Of these genes, CCND1, ENY2 and KMT5A had epithelial cell-specific cis-regulatory elements at their distal regions in luminal breast cancer patient tissue49. Together, these results suggest G1 and G9 might resemble to cycling breast cancer persister cells and their 3D chromatin structures might be regulated by chromatin modifying enzymes.Fig. 5: Characteristics of TISPs in breast cancer cells.a The integration score of G1 and G9. *: Two-sided Wilcoxon rank-sum test. Values in box plot from big to small are maxima, the 75th percentile, median, the 25th percentile and minima. b Enrichment of REACTOME chromatin modifying enzymes signaling pathway of G1 genes. NES normalized enrichment score. p value was determined by permutation-based calculation with number of permutations at 1000. c Comparison of the integration score between G1 and G9. *: Two-sided Wilcoxon rank-sum test. Values in box plot from big to small are maxima, the 75th percentile, median, the 25th percentile and minima. d The expression of chromatin modifying enzymes in relapse-free and relapse breast cancer patient cohort GSE2990. e Enrichment of REACTOME RNA polymerase II transcription signaling pathway of the combination of G2, G3, G10 and G11. NES normalized enrichment score. p value was determined by permutation-based calculation with number of permutations at 1000. f The expression of transcription regulators in relapse-free and relapse breast cancer patient cohort GSE2990. g Real-time live cell growth curve of PRMT6 inhibitor MS023. Cells treated with DMSO as reference. *p < 0.05, two-sided paired Student’s t test, p value is 0.0291. h Real-time live cell growth curve of DYRK2 inhibitor LDN-192960. Cells treated with DMSO as reference. **p < 0.01, two-sided paired Student’s t test, p value is 0.0097. i Cell proliferation assay of MS023 and LDN-192960 in MCF7 cells. j Cell proliferation assay of MS023 and LDN-192960 in MCF7M1 cells. *p < 0.05, two-sided paired Student’s t test, p value is 0.0262. k Cell proliferation assay of MS023 and LDN-192960 in MCF7TR cells. *p < 0.05, **p < 0.01, two-sided paired Student’s t test. p value of MS023 vs. DMSO in day 2, 4, 6 are 0.0187, 0.0396, 0.0035 individually. p value of LDN-192960 vs. DMSO in day 4, 6 are 0.0302, 0.0307 individually. Three biological replicates were performed, and data were presented with mean values ± standard deviation in (g–k). Source data are provided as a Source Data file in (g–k).On the other hand, cell subpopulations, G2, G3, G10 and G11, had high cycling genes in CADs of scHi-C clusters but low cycling genes in DEGs of scRNA-seq clusters. REACTOME RNA polymerase II transcription signaling pathway was the top enriched pathway from these four subpopulations (Fig. 5e). Of 21 enriched genes, CEBPB and YEATS4 existed in G2, THOC7 and TXNRD1 in G2 and G10, and COX7A2L, RPS27A, UBE2I, ZNF221 and ZNF223 in G10, while RPRD1A existed in G3, NELFA, PPM1D and SRAF1 in G3 and G10, and BNIP3L, BTG2, CNOT6, DYRK2, EAF1, MED1, PABPN1 and TIGAR in G10 (Supplementary Fig. 16a). Higher expression of transcription regulators in breast cancer patient cohorts was correlated with a lower recurrence-free survival (Fig. 5f and Supplementary Figs. 16b–h, 17a–e). Among them, CEBPB, COX7A2L, NELFA, SRSF1, TXNRD1, UBE2I had epithelial cell-specific cis-regulatory elements at their distal regions in luminal breast cancer patient tissue49. Collectively, these results suggest that these four cell subpopulations might resemble to non-cycling breast cancer persister cells and their 3D chromatin structures might be regulated by transcription regulators.To further substantiate our findings, we performed an experimental validation for the drug treatment on the two selected genes identified by our MUDI, PRMT6 and DYRK2. The section of these two genes was purely due to the commercially available inhibitors to them. We treated MS023, an inhibitor to PRMT6, a key regulator in G1 and G9 subpopulations, and LDN-192960, an inhibitor to DYRK2, a key transcriptional regulator in G10. We found both inhibitors showed stronger growth inhibition in MCF7TR cells than that in MCF7 cells (Fig. 5g, h), as well as impeded MCF7TR cells from cell proliferation but not MCF7 (Fig. 5i–k), demonstrating the capability of the inhibitors of these regulators in restoring the drug-sensitivity.Taken together, we propose a mechanistic model with two distinct 3D-regulated cellular states for the transition of drug-sensitive to tolerant cancer cells: (1) a drug-sensitive cancer cell subpopulation with silenced chromatin modifying enzymes initially shows very lower chromatin interactions (Supplementary Fig. 17a); upon an interim drug treatment, this subpopulation activates the enzymes to trigger higher chromatin interacting activities for the cycling genes, resulting in reversible cancer persister cells (Supplementary Fig. 17b); under a long-term drug treatment, they further reshape the altered 3D chromatin structures render a cycling drug-tolerant cancer cells (Supplementary Fig. 17c); and (2) another drug-sensitive cancer cell subpopulation with silenced transcription regulators initially shows lower chromatin interactions (Supplementary Fig. 17d); upon an interim drug treatment, this subpopulation activates transcription regulators to trigger higher chromatin interacting activities for the non-cycling genes, resulting in reversible cancer persister cells (Supplementary Fig. 17e); under a long-term drug treatment, they further reshape the altered 3D chromatin structures render a non-cycling drug-tolerant cancer cells (Supplementary Fig. 17f).

Hot Topics

Related Articles