MethNet: a robust approach to identify regulatory hubs and their distal targets from cancer data

MethNet constructs regulatory networks using TCGA dataMost of the published epigenome-wide association studies (EWAS) have focused on links between differentially methylated regions and the expression of the closest gene. However, it is important to account for long-range interactions when modeling gene expression since regions that are distal on the chromatin fiber can be brought into close proximity by chromatin folding. Indeed, gene regulation occurs largely within highly self-interacting, megabase sized ‘topologically associated domains’ (TADs)27,28,29, that are separated by ‘insulating boundaries’ enriched for CTCF and cohesin. The boundaries are functionally important as they limit inter-TAD interactions, so that enhancers predominantly contact promoters within the same TAD. TADs form via ‘loop-extrusion’, with cohesin rings extruding DNA until they encounter two convergently oriented CTCF binding sites, which form the base of a loop30.To account for this, MethNet considers as potential regulatory elements all the CpG probes that are within a 1Mbp radius of the TSS. It uses gene expression and methylation data from TCGA samples to construct predictive models that quantify the contribution of each site to a gene’s regulation (Fig. 1). TCGA is the largest resource with both genome-wide gene expression and DNA methylation data from the same patient samples and is therefore well-suited for this type of analysis. Gene expression is modeled separately for each cancer-type to account for context-specificity. Elastic-net regularization is used to restrict spurious associations. A consensus network was constructed by aggregating associations across all cancers, so that robust associations found in multiple cancers were ranked higher relative to cancer specific associations and thus are less likely to correspond to false-positive results. Finally, we computed the regulatory potential of every CRE as a function of all the associations it is involved in, and characterized the epigenetic features that contribute (Fig. 1).Fig. 1: Outline of the MethNet pipeline.Paired RNA-seq and DNA-methylation data were accessed from TCGA. The expression of every protein coding gene across every cancer dataset was modeled as a function of the methylation status in a 1 Mb radius surrounding the gene to generate a set of regulatory networks across all cancers. These were aggregated to produce a network of robust associations that were used to identify regulatory elements and hubs.MethNet identifies putative activating and repressing distal associationsThe challenge for any CRE-discovery method is that the number of potential associations grows exponentially with the radius of their regulatory window (Fig. 2a). A-priori there are approximately 8 million potential regulatory associations between CpG sites in a window size of 1 Mb on either side of every protein coding promoter. MethNet uses an average of 400,000 (5%) associations per individual cancer to model the expression of every gene (more statistics are shown in Figure S1). A pan-cancer analysis indicated that there is strong context-specificity in the regulatory network, with up to one third (2.6 million) of all potential CREs having a putative functional impact in at least one cancer. The strength of these regulatory associations varied across different cancer types, with the majority of associations being specific to a particular cancer, consistent with lineage specific transcription factor-mediated gene regulation (Fig. 2b). The context-specificity of the regulatory networks identified are associated with varying degrees of accuracy and reliability across different cancer types. Notably, larger sample sizes enhance MethNet’s capacity to uncover robust and reliable associations (Fig. 2c). This observation underscores the potential scalability of MethNet and its ability to leverage larger datasets to further elucidate the complex regulatory mechanisms governing gene expression in cancer. The results of the MethNet pipeline are shared as supplementary data on figshare (see Data Availability).Fig. 2: MethNet identifies putative activating and repressing distal associations.a Distribution of HumanMethylation450 probes neighboring a protein-coding gene as a function of the window size. At 1 Mbp the average gene has 400 potential regulators. b Histogram of the robustness of MethNet associations as measured by the number of TCGA cancers it appears in. c Performance of MethNet models, measured by the ratio of explained variance (R2), as a function of dataset size. Trend line is fit with linear regression. Shaded area corresponds to 95% confidence interval of the mean performance given the number of samples in a cancer study (n = 24). d MethNet association effect as a function of distance. Associations are grouped based on their ranked distance to a gene, where −1 includes associations from the first CpG island or non-island upstream of the promoter, and positive distance refers to associations downstream of the TES. For every group of potential associations, we calculated the average distance, mean coefficient, and probability of a MethNet association. e Examples of regulatory effects recovered by MethNet in BRCA. Left: A repressive association between IFNγ and a CTCF binding site 250 kb upstream of the promoter. Right: An activating association between GSTT1 and a non-coding RNA 10 kb downstream of the promoter. The linear regression line that is fit models the mean expression as a function of methylation status of the CRE. Shaded area corresponds to 95% confidence interval (n = 868). f Spearman correlation coefficient for the association shown in panel e across all TCGA cancers. We observe that the signal is robust across all cancers (n = 24).MethNet successfully recovers known regulatory mechanisms such as the well documented effect of gene silencing when a promoter is methylated (Fig. 2d). Specifically, methylation of the first CpG island upstream of the TSS has a strong negative correlation with expression. Furthermore, MethNet identifies CpG islands as being more likely to associate with and have a stronger effect on gene expression than inter-island regions in an unbiased way31 (Fig. 2d). On average, the closer a CRE is to its target gene, the greater the probability of having an impact on transcriptional output. Importantly, however, the probability of CRE-target gene association does not diminish to zero, indicating that CRE methylation beyond the immediate vicinity of the promoter can have an impact on gene expression in a context-specific manner.We define MethNet associations that have a negative or positive coefficient as activating or repressing, respectively. In activating associations, a gain of methylation is predicted to decrease gene expression, while in repressive associations, a gain of methylation is predicted to increase gene expression. For example, MethNet identified a CTCF binding site located 250 kb upstream of the IFNγ promoter whose demethylation is linked to transcriptional repression (Fig. 2e and Supplementary Fig. S2). This suggests that CTCF binding to the unmethylated DNA sequence could be acting as an insulator preventing the IFNγ promoter from coming into contact with elements that activate its expression. Silencing of this inflammatory cytokine could have profound effects on immune responses to cancer, and its regulation is thus important in the context of immunotherapy. In contrast, MethNet identified the promoter of a non-coding gene located downstream of GSTT1 that when demethylated activates Glutathione S-transferase1 expression. This suggests that the promoter of the non-coding gene may be acting as an enhancer for GSTT1 (Fig. 2f and Supplementary Fig. S3). Glutathione S-transferases (GSTs) are phase II metabolizing enzymes that play a key role in protecting against cancer by detoxifying numerous potentially cytotoxic/genotoxic compounds. These two examples highlight the complex interplay between DNA methylation and gene expression, providing valuable insight into the mechanisms underlying the regulation of cancer relevant genes.The regulatory potential of CREs is correlated with chromatin context and contact frequencyThe number of potential CREs controlling transcriptional output varies by gene, and the number of genes controlled by a single CRE varies by element. To study individual CREs and rank them by their impact on transcriptional output we defined a metric to capture the intrinsic regulatory potential. In particular, we aggregated a CRE’s contribution to the regulation of all genes in its vicinity. We quantified the importance of an association as the excess of its relative effect size with respect to a null model, where all elements have the same intrinsic potential and the only factor differentiating them is their distance to the gene promoter. The potential of a CRE was calculated by summing all the distance-adjusted contributions to genes in its vicinity (Fig. 3a).Fig. 3: The regulatory potential of CREs is correlated with chromatin context and contact frequency.a Schematic depiction of regulatory potential. We quantified the importance of an association as the excess of its relative effect size with respect to a null model where all elements contribute proportionally to their distance from the promoter. b Enrichment or depletion of regulatory potential by ChromHMM state of the CREs (n = 245,511). Bar length represents mean effect compared to Low Signal state and error bars corresponds to ± the standard error (for details see Methods). c Enrichment or depletion of chromatin remodelers, the transcription machinery, and transcription factor binding sites. d Enrichment or depletion of H3K27ac chromatin loops from CD4-naïve T cells, GM12878 and K562 cells for CREs that do not overlap with protein-coding promoters (n = 166,552). Bar height represents mean effect on MethNet potential for each group compared to “0” loops which is the intercept. Error bars correspond to 95% confidence interval.We next conducted a series of enrichment analyzes to uncover distinctive characteristics that are linked to a CRE’s regulatory potential. First, we analyzed the chromatin state of each CRE using ChromHMM32 (Fig. 3b). These investigations indicate that promoters acting as enhancers controlling other distal genes have the highest overall regulatory potential. Our observations are in line with the known regulatory role of distal gene promoters9. Indeed, the associations detected by MethNet transcend linear distance, highlighting its capacity to identify distal regulatory networks.Methylation has been shown to directly affect the presence of the transcriptional machinery, chromatin modifiers, and binding of transcription factors (TFs) whose chromatin occupancy in turn, can act as a barrier to methylation3,33,34,35. To identify which factors could contribute to the regulatory potential of a CRE, we performed an enrichment analysis and identified members of the RNA polymerase complex (such as POLR2A and POLR2G) as the most enriched factors (Fig. 3c). Additionally, we identified other highly enriched factors known to be involved in chromatin remodeling and altering the methylation status, including EGR1, HDAC1, and PHF8. Depletion of EZH2 was also linked to increased regulatory potential which make sense as EZH2 is part of the PRC2 complex which characterizes inactive chromatin. Moreover, using Hi-ChIP data from the benchmark study of Bhattacharyya et al.36, we observed a strong positive correlation between the regulatory potential of a region and the number of active chromatin loops (H3K27ac loops) anchored at it, further highlighting the dynamic interplay between chromatin structure and regulatory activity (Fig. 3d).MethNet hubs that control multiple genes have an impact on patient survivalThe distribution of the regulatory potential is long tailed, suggesting the existence of CREs with exponentially high regulatory potential (Fig. 4a) linked to multiple genes (Supplementary Fig. S4a, b). Intriguingly, these elements exhibited low methylation variance across the different TCGA datasets, indicating that they could be under robust and stringent regulation. We used the elbow method to show that above a select threshold, CREs were significantly enriched for regulatory associations. This approach identified 6137 CREs, that we refer to as ‘MethNet hubs’, since, as expected, their profile aligns with a model in which CREs exert control over multiple genes (Fig. 4b). A survival analysis, using the Cox proportional hazard model, across TCGA cancers with clinical data, revealed that methylation of MethNet hubs has a bigger impact on the overall survival of patients in comparison to non-hub elements of similar variance, both in a pan-cancer and cancer-specific context (Fig. 4c and Supplementary Fig. S4c). This finding provides evidence for MethNet hubs having a pivotal role in the context of cancer biology and underscores their potential clinical relevance.Fig. 4: MethNet hubs control multiple genes and have an impact on patient survival.a Distribution of regulatory potential as a function of methylation variance across cancers. b Comparison of the distribution of MethNet association per elements for hubs (n = 6,139) versus non-hubs (n = 239,416). Boxplots were drawn with the following parameters: box bounds correspond to 1st and 3rd quantiles, center mark corresponds to median, whisker length is 1.5 the height of the box (inter-quantile region) or up to the extrema of the distribution if they are closer to the box bound. c Mean effect of hub methylation (excluding cancer-specific effects) on overall survival across TCGA cancers. A two-sided Wilcoxon rank sum test with continuity correction is used to calculate the p-value (p-value = 5 × 10−11, nHub = 574, nCRE = 174,139). Boxplots were drawn with default parameters as in panel b. d–f Enrichment of regulatory potential hubs versus non-hub CREs with positive potential (see Methods for details). d Enrichment of hubs versus non-hubs CREs across ChromHMM (n = 121,517). Bar length correspond to mean effect of state versus Low Signal state and error bars correspond to 95% confidence interval. e Enrichment of hub versus non-hub CREs across chromatin remodelers and transcription factor binding sites f Enrichment of hub versus non-hub non-promoter CREs (n = 73,204) as a function H3K27ac chromatin connectivity from CD4-Naive, GM12878 and K562 cells. Bar height represents log odds (logit) effect size on the probability of a CRE being a hub as a function of its connectivity group, no loops is the intercept of the model. Error bars correspond to 95% confidence interval.To gain insight into the characteristics of MethNet hubs, we repeated the previous enrichment analyzes, this time comparing hubs to other CREs with positive potential (Fig. 4d–f). As anticipated, we observed that hubs share many of the characteristics exhibited by high-ranking non-hub elements, however, key differences emerged upon closer examination. First, hubs are more likely located in open chromatin regions than non-hub CREs (Supplementary Fig. S4d, e). Next, in the ChromHMM and binding site analysis, insulating elements and CTCF were respectively enriched in MethNet hubs compared to non-hubs (Fig. 4d, e). This finding is consistent with the known role that insulating elements play in gene regulation via chromatin looping and insulated topologically associated domain (TAD) boundaries37,38. This hypothesis is further supported by our data showing that hubs are depleted in elements lacking chromatin loops (Fig. 4f). In summary, we identified MethNet hubs that are characterized by high-ranking regulatory potential and whose methylation status has low variance. Hubs are enriched for regulatory associations and insulating elements and show significant clinical relevance with respect to cancer patient survival.MethNet hubs uncover known and potentially novel regulatory elementsExamples of two regulatory hubs are shown in Fig. 5. The regulation of the Protocadherin alpha (PCDHA) cluster of genes has been shown to be controlled by a regulatory hub (HS5−1) that overlaps a CTCF binding site, which stochastically activates different PCDHA genes by cohesin-mediated looping39,40. Using MethNet we were able to unbiasedly identify this regulatory hub (highlighted in blue). We also found another hitherto unknown regulatory hub that is enriched for H3K27ac and H3K4Me3 (highlighted in orange) upstream of PCDHA. This hub has predicted regulatory associations with genes from all three clusters of the Protocadherin family, PCDHA, PCDHB and PCDHG, and the region has been characterized as a schizophrenia risk locus that is linked with the regulation of all three protocadherin families in the context of brain41. In-situ Hi-C data reveals the high contact frequency of this CRE with all three Protocadherin families (as shown in red in the Hi-C heatmap).Fig. 5: MethNet hubs uncover known and potentially novel regulatory elements in the Protocadherin gene cluster.(chr5:140,000,000−141,000,000) MethNet associations are shown in the top track. Red associations are activating and black repressing. All other tracks are from UCSC Genome Browser. Chromatin marks and CTCF binding sites data are provided by ENCODE. In-situ Hi-C data were generated by Rao et al., and processed with Juicebox to compute contact enrichment (colormap = log2(observed/expected normalized counts), range [−4, 4]). The HS5-1 enhancer, (highlighted in blue), is a known regulator of the PCDHA cluster. The enhancer on the left (highlighted in orange) is a MethNet discovery that is identified as a regulator of the PCDHA, PCDH and PCDHA clusters. The region of the previously unreported enhancer has increased contact frequency with all three Protocadherin families (highlighted by red in the Hi-C heatmap).High scoring MethNet associations are mediated by long-range chromatin interactionsTo determine whether distal CRE associations identified by MethNet are brought into contact with their target genes by chromatin looping, we performed a promoter-capture Hi-C experiment that identified chromatin interactions from all promoters in the genome in two distinct well characterized A549 and K562 cell lines. The quality control for the promoter-capture Hi-C is shown in Supplementary Fig. S5. Promoter-capture Hi-C, which enriches for loops anchored at gene promoters, allows us to simulate parallel 4C-seq experiments and recover regions that are in physical contact with each promoter. Given that MethNet consists of common and cell type specific associations, we would not expect all associations from our pan cancer analysis to be validated with the promoter-capture Hi-C data from the A549 and K562 cell lines. Indeed, an example of multi-locus interactions for the TP53 gene promoter in A549 and K562 shown in Fig. 6a, highlights the cell type specific chromatin contacts found in the two cell lines.Fig. 6: High scoring MethNet associations are mediated by long-range chromatin interactions.a Example of MethNet associations regulating TP53 that overlap chromatin loops identified using promoter-capture Hi-C from the A549 and K562 cell lines. The Genome Browser session shows the chromatin context around the promoter of TP53 (chr17:6,886,465-8,364,371) for both cell lines. This includes RNA-seq (ENCODE/Caltech for K562, ENCODE/HAIB for A549 ETOH), methylation status (ENCODE/HAIB – orange and blue correspond to methylated and unmethylated regions, respectively), CTCF binding sites, and ChromHMM chromatin states for K562. A549 ChromHMM states were downloaded from Roadmap Epigenomics (15-state core model). All tracks were loaded with default settings, except RNA-seq which was capped at the top for a better overview. Red and black bars correspond to predicted activating and repressive MethNet associations, respectively. Promoter-capture Hi-C loops (shown as arcs) that overlap with MethNet CRE predictions are shown in virtual 4 C format. Darker arcs correspond to loops called in both cell lines. b Bar graph depicting the probability of a MethNet association overlapping with a chromatin loop in either cell line as a function of its score (n = 1,585,070). Bar heights correspond to the probability of an association of the corresponding group overlaping a loop computed using a logistic regression model. Error bars correspond to the 95% confidence interval. c ROC curves showing the ability of MethNet potential to predict chromatin hubs. We only considered gene promoters because of experimental bias. The AUC increases for stricter criteria of hub calling.To determine whether associations with higher scores are more likely to be facilitated via chromatin interactions, we used the union of loops from the A549 and K562 cell lines. We found a robust correlation between association score and probability of loop formation across all levels (Fig. 6b). This data demonstrates that stronger MethNet associations are more likely to act via chromatin contacts, while weaker connections may represent indirect effects.Next, we investigated whether the regulatory potential of MethNet is predictive of chromatin hubs, i.e. CREs that form multi-locus loops42,43. Remarkably, MethNet’s regulatory potential demonstrated a high predictive power for identifying multi-locus loops, achieving an area under the receiver operating characteristic curve (AUC) of 86%, when applying the strictest criteria (Fig. 6c). Although our analysis was primarily focused on promoter hubs (due to the experimental bias of promoter-capture Hi-C), the predictive power of MethNet extended to non-promoter regions when less stringent criteria were used for calling hubs (maximum AUC 84%, Supplementary Fig. S6). In sum, our data indicate that high-ranking distal MethNet associations are brought into close physical proximity by long-range chromatin interactions.Perturbation of MethNet hubs results in altered target gene expressionTo functionally validate the predictions generated by MethNet, we performed perturb-seq that combines targeted perturbation of genomic regions with single-cell RNA sequencing (scRNA-seq). Compared to a regular CRISPR interference (CRISPRi) assay, perturb-seq enables the simultaneous investigation of multiple genomic regions, using a pool of guide RNAs (sgRNAs). For the CRISPRi, we used the dCas9-KRAB-MeCP2 system that blocks the binding of other factors and methylates the DNA as well as histones of the targeted region44, inducing robust silencing. The transcriptomic readout aligns well with the MethNet pipeline, which predicts changes in the expression of target genes. Furthermore, perturbation of distal regulatory elements is more likely to lead to subtle gene expression changes rather than cell death45, particularly when there is more than one CRE controlling a gene target. Supplementary Fig. S7 shows the percentage of A549 cells transfected with dCas9-KRAB-MeCP2 at day 14 after puromycin selection.In total, we targeted 55 potential regulatory elements with 2 to 5 guides each. To address the inherent limitations of the assay, targets were selected based on the following criteria: (i) CREs were unmethylated in A549 cells to allow for methylation by the dCas9-KRAB-MeCP2, (ii) 2 to 5 high-quality guides could be selected using the CRISPick46,47 scoring system, and (iii) putative target-genes were expressed at levels detectable by scRNA-seq. An outline of the perturb-seq validation experiment is shown in Fig. 7a.Fig. 7: Perturbation of MethNet hubs results in altered target gene expression.a Outline of the perturb-seq validation experiment. b MA plot showing the differential expression induced by CRISPRi targeting of regulatory regions. Points correspond to genes targeted by sgRNAs. Point color indicates whether the interaction was predicted by MethNet. Point shape indicates whether the interaction was considered significant using the False Discovery Rate (FDR) of 0.05. c Bootstrap distribution of the number of targeted regions that show significant changes in gene expression. The number of observed validated targets is marked by a red arrow. d Genome Browser session showing an example of a CRE hub validated by expression changes in two predicted gene targets. A zoomed in version of the browser around the hub regulatory element is shown. The effect of all sgRNA guides targeting the CRE on expression (log of normalized counts) of target genes across all cells is shown in the violin plots and boxplots (nUntargeted = 17,790, nTargeted = 1301). Boxplots were drawn with the following parameters: box bounds correspond to 1st and 3rd quantiles, center mark corresponds to median, whisker length is 1.5 the height of the box (inter-quantile region) or up to the extrema of the distribution if they are closer to the box bound. A Welch two-sided t-test is used to calculate the p-value (pANXA2 = 2.8 × 10−6, CIANXA2 = [0.028, 0.069], pGCNT3 = 1.3 × 10−4, CI GCNT3 = [0.034, 0.106]). Figure 7/panel a Created with BioRender.com released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license (https://creativecommons.org/licenses/by-nc-nd/4.0/deed.en).Although we designed our experiment so that each cell received a single guide, some cells contained multiple guides (Supplementary Fig. S8). Therefore, we used a linear model with a complex design matrix to deconvolve the individual effects of each guide on gene expression, similar to the approach used by Dixit et al.48. The results of this analysis are illustrated in Fig. 7b. Among the 55 targeted CREs, 17 were validated as being involved in 22 functional associations predicted by MethNet. To assess the significance of the identification of 17/55 functional CREs, we performed a bootstrap analysis, by randomly shuffling the sgRNA labels across cells, while maintaining the total number of detected guides. We generated 20,000 bootstrap samples to estimate the null distribution (Fig. 7c) and estimated that the probability of detecting 17 or more regulatory regions was highly unlikely to have arisen by chance (p = 0.0004). This statistical evaluation supports the robustness and significance of the regulatory regions identified through perturb-seq.Out of the 17 functional CREs identified, 4 were found to be associated with more than one target gene. In Fig. 7d we highlight a regulatory hub that corresponds to the promoter of BNIP2 (highlighted in orange). BNIP2 itself is not highly expressed so it was not captured by the perturb-seq, but expression of two predicted target genes, GCNT3 and ANXA2 (red loops) were found to be significantly downregulated. Both of these genes are associated with poor prognosis in multiple cancers. ANXA2 is a member of the calcium-mediated phospholipid-binding protein family of annexins, involved in epithelial mesenchymal transition, cell proliferation and survival49, while GCNT3, is a member of the N-acetylglucosaminyltransferase family that is associated with cell proliferation, migration and invasion in non-small-cell lung cancer50. Our data indicate that the hub corresponds to a promoter region that is predicted to act as an enhancer for GCNT3 and ANXA2, and thus methylation leads to a drop in their expression as we observed. We also identified two genes, MYO1E and LDHAL6B that are predicted by MethNet to be targets of the hub (highlighted by gray loops in the screenshot) that we were unable to validate. Although both gene targets are expressed and unmethylated in A549 cells, unlike ANXA2 and GCNT3 they were not connected to BNIP2 by chromatin loops in our promoter-capture Hi-C. Chromatin looping is likely to be important for long-range hub mediated regulation, and we speculate that in the case of these two target genes, contacts are regulated in a cell type specific manner. Other MethNet associations validated by the perturb-seq experiment are shown in Supplementary Fig. S9. These include the target gene AMIGO2, a cell adhesion protein that is linked with cell survival and metastasis of multiple adenocarcinomas51,52 GFBP4, a tumor suppressor acting as double-negative feedback in AKT and EZH2 signaling53 (Supplementary Fig. S10). Taken together, these data confirm the robustness of the MethNet pipeline in predicting regulatory associations.

Hot Topics

Related Articles