Context transcription factors establish cooperative environments and mediate enhancer communication

Ethical approvalThis research did not require previous ethical approval. Information on interindividual genetic variability was obtained from a previous study45 in the form of a publicly available dataset in summary format.Defining enhancer categoriesInformation on caQTLs, peaks and CMs in LCLs from 100 individuals45 was retrieved from Zenodo (DOI: 10.5281/zenodo.1405945). We merged files with caQTL and peak information and isolated SNPs exactly one nucleotide in length and falling within peak boundaries.To isolate caQTL and no-effect SNPs, we applied the following criteria using information provided in the original publication45 (see Supplementary Methods for detail): caQTL = PcaQTL × PLead > 0.7 and PcaQTL > 0.999 and SNP-to-peak-center-distance ≤ 350 bp; ‘no-effect’—PcaQTL < 0.2 and PLead < 0.5 and no overlap with CMs or caQTL-neighbor peaks. In addition, we only retained enhancers using the annotatePeak function (ChiPseeker67) and gene annotations from the TxDb.Hsapiens.UCSC.hg19.knownGene package.No-effect enhancers were downsampled to a subset with equal GC content and ‘SNP-to-peak center’ distances as that of the caQTL set. Sampling was repeated 50 times to maximize the robustness of results (Supplementary Methods). A random sample of enhancers was generated by sampling 25,000 peaks from the full peak table without replacement. Only enhancers annotated as ‘distal’ or ‘intron’ were retained, resulting in ~21k total putative enhancers.GM12878 ATAC–seq processing and peak scoringDNA accessibility (ATAC–seq) data for GM12878 were retrieved from ENCODE (experiment: ENCSR637XSC). Read 1 and Read 2 fastq files were combined before aligning to the hg19 reference genome (GRCh37, release 75 from Ensembl) using the BWA-MEM tool68 with standard settings. Alignment files were sorted and indexed with SAMtools (version 1.9)69 and subsequently transformed to the ‘bigwig’ format using the bamCoverage (version 3.5.0) command from deepTools70. To obtain an accessibility value for each enhancer, counts falling within ±350 bp around the peak center were summed.Human TF motifs were downloaded from the HOCOMOCO database3 (version.11). To generate enhancer context scores, peak sequences (±350 bp of center) were scored in forward and reverse orientation using the ‘calculate’ function within Biopython71. For each enhancer, the best score was retained (top scoring). For caQTL enhancers, top scoring was done separately for each genotype. Different top scores across genotypes indicate the creation or destruction of an enhancer-wide top site. Sequences from the random enhancer sample were scrambled, and top scores were recomputed to control for enhancer base composition biases72.TFBS score computation and TF class thresholdingTo compute TF accessibility scores, we used a log-linear regression model, relating the cumulative ATAC–seq signal of randomly sampled enhancers to the top score of a TF in those enhancers while controlling for base composition bias (Supplementary Methods). We used the ‘Holm-Bonferroni’ correction to control for multiple hypothesis testing and P < 0.01 to define TFs with significant accessibility scores.For the initiator score, we computed the number of times a SNP created a new best site for a TF when going from the less to the more accessible genotype and compared frequencies between caQTL and no-effect enhancers using Fisher’s exact test. A final log2 odds ratio was assigned by averaging across the 50 samples of no-effect enhancers, and a cutoff of 0.9 was set to define initiator TFs (Supplementary Methods). We repeated the analysis to look for ‘repressors’ by asking how often a caQTL versus no-effect SNP creates a new best site when going from the more to the less accessible genotype.For the context score of a TF, we used a linear regression model relating the top BS score of a TF in the less accessible genotype to the binary enhancer status (caQTL or no-effect), while controlling for overall ATAC–seq coverage for each TF separately (Supplementary Methods). The procedure was repeated across the 50 samples of no-effect enhancers, and the average across all samples was used as the final context score (t-statistic of regression coefficient). Enhancers in which the SNP created a new best site for a given TF were dropped before fitting each model. A threshold of t > 3.2 was used to define context TFs (Supplementary Methods).All TF motifs with a significant initiator score obtained the ‘-initiator’ label. Next, initiators were subdivided into context or independent initiators based on their context score (t-statistic > or <3.2). TF motifs with high context- but low initiator scores were labeled ‘context-only’. Finally, TF motifs passing the accessibility score threshold, but not those of the other two scores, were labeled ‘USTs’, with the name derived from ref. 48. Overlap with USTs was assessed using a Fisher’s exact test.TF cell type specificity indexTo link TFs to enhancer annotations, we leveraged the DNA hypersensitive sites (DHS) index50 and isolated enhancers annotated as either ‘lymphoid-specific’ (specific) or ‘tissue-invariant’ (universal). We derived a ‘lymphoid specificity index’ for TFs in the following manner: we split the random set of enhancers into those with above- or below-average top TFBS scores (mean-centered), computed the ratio of lymphoid-specific versus tissue-invariant enhancer annotations for each of the two splits, and took the ratio of ratios as the final indicator.Analysis of TF BS syntaxSequence logos across caQTL enhancers were generated using the R package ‘ggseqlogo’ (Supplementary Methods). Motif similarities were computed with the TomTom tool part of the MEME-Suite using default settings and visualized by hierarchical clustering. TF class-wide summaries were generated by dividing the number of similar motif pairs (q < 0.1) by the number of possible motif pairs across classes (Supplementary Methods).To compute the number of nonoverlapping top TFBSs, we first extracted the position of top sites with above-average scores. Next, we merged overlapping motifs (less than 10 bp apart) to obtain a final, nonoverlapping count of TFBSs in an enhancer. We computed distances to the peak center or the enhancer-centric SNPs for all TFs separately. For the distribution of top TFBSs with respect to the peak center, all TF-peak center distances within a TF class were considered. For the positioning with respect to the SNP, the average top TFBS-to-SNP distance per TF was computed first, and class-wide differences were assessed across these averages.For the analysis of combinatorial TF binding, caQTL enhancers were split based on whether a new best BS for TF1 was created or not. Between the two splits, we compared (1) the top TFBS scores for TF2 using a two-sided Student’s t-test and (2) the distances between the top TFBS of TF2 and the SNP (TF1 site). For (2), we set the cutoff for a potential fixed BS syntax to distances <25 bp and used Fisher’s exact test to compute odds ratios (Supplementary Methods). To compare associated P values, we took the negative log10 P value first and assigned a negative sign for odds ratios <1 (Supplementary Fig. 4).ASA in GM12878We performed ASA detection across the whole genome using GM12878 ATAC–seq data downloaded from ENCODE73 (experiment: ENCSR637XSC). We downloaded phased genotype information from the Genome in a Bottle database (https://www.nist.gov/programs-projects/genome-bottle). We (1) aligned the .fastq file on the hg19 reference genome (GRCh37, release 75 from Ensembl) using BWA-MEM68 (v0.7.17-r1188), (2) marked duplicated reads using Picard74 (v2.17.8) and (3) counted the reads for each allele of the variants described in the phased genotype VCF file using Freebayes75 (v1.3.4) with the following options (–report-monomorphic, –only-use-input-alleles, –min-alternate-fraction 0, –variant-input (vcf file); Supplementary Methods) and summarized the ASA files by counting the RO (reference) versus AO (alternate) allele for heterozygous calls.SuRE-seqWe selected an equal number of caQTL-probability- and GC-content-matched LOCAL- and CM-lead enhancers, as well as their respective neighbor/dependent elements encompassing a total of 270 bp (for a detailed selection strategy, see Supplementary Methods). A full list is included in Supplementary Data 1. The designed sequences were submitted to Annogen (https://www.annogen.bio/) for SuRE screening services in GM12878. Transfections and data acquisition were handled by Annogen, following the transfection protocol described here52.Count tables were processed as follows: barcodes with <2 coverage in the barcode-to-element mapping library (‘Reads’ in the raw count tables), that mapped to more than one fragment or that had 0 counts in the plasmid library were discarded. Elements with less than four barcodes were discarded. Different orientations of fragments were combined before computing the SuRE activity of each element-barcode combination as the log2 ratio of mRNA over plasmid DNA counts (Supplementary Methods). To compute replicate agreement, each element was summarized by its average SuRE activity across barcodes.sureQTLs were defined as caQTL elements whose SuRE activity differed significantly across genotypes. Significance was assessed using the Student’s t-test and the ‘Bonferroni’ method to adjust P values. The sureQTL–eQTL overlaps were calculated using eQTL information given in the variant table of the original LCL study45, conditioning on either all sureQTLs or sureQTLs with increasingly large effect sizes.We used Pearson’s correlation to compute the relationship between average SuRE activity (less accessible genotype) and the top TFBS within fragments. To compare sureQTL to caQTL (ASA in GM12878) effect sizes, we split fragments by their sureQTL status first. To test for the enhancing effect of TF motif pairs, we compared averaged SuRE activities across fragments (less accessible genotype) containing above-average top TFBSs for either both TF motifs or only one of them using a two-sided Student’s t-test. The differences in means were extracted as a summary statistic.STARR-seqSTARR-seq libraries containing 110 bp of sequence context and fixed primer sites on each end were ordered from Twist Biosciences. For a full list of sequences in either the 5-bp or 10-bp spacer library, see Supplementary Data 2 and 3. A detailed design overview is given in the Supplementary Methods. In brief, we generated three random sequence contexts (no-TFBS) in which different homotypic and heterotypic motif repeats were embedded using eight TF motifs (not enriched but highly expressed, MYC = CCACGTGC; independent initiator, NFKB (REL) = GGGAAATTCCC and CTCF = CCACCAGGGGGCGC; context initiators, SPI (PU.1) = AAAGAGGAAGTGA, IRF = GAAAGCGAAACT and RUNX = TTTGTGGTTT; context-only, MEF2 = GCTAAAAATAGAA and FOX = CTGTTTACTTT).Libraries were amplified with primers carrying a 12-bp random barcode (3′ end), Nextera sequencing adapters, as well as overhangs for assembly into the STARR-seq vector (Addgene, 99296; Supplementary Table 1) according to the manufacturer’s instruction (ten cycles). Amplified pools were cloned into the STARR-seq vector by Gibson Assembly (New England Biolabs, NEB) and transformed into Dh5α high-efficiency competent cells (NEB). Transformed cells were spread on Ampicillin plates to achieve ~50,000 individual clones for both 5-bp and 10-bp spacer libraries. Colonies were scraped and transferred to a liquid growth medium (Luria Broth) and grown at 37 °C for 2 h before plasmid DNA isolation (Maxiprep Kit, Invitrogen).MEC1 cells were cultured in Iscove’s Modified Dulbecco’s Medium (IMDM, Gibco) supplemented with 10% fetal bovine serum and 1% penicillin–streptomycin to obtain ~50 million cells. Transfections of 5-bp and 10-bp libraries were performed in duplicate using ~6 × 106 cells and 30 μg STARR-seq plasmid per replicate. Transfections were carried out with the Neon Transfection System (Thermo Fisher Scientific)—three pulses of 20 ms at 1,200 V in R buffer using 100 μl tips. Cells were collected 24 h post-transfection and lysed in Trizol (Thermo Fisher Scientific). RNA was extracted by chloroform extraction, followed by isopropanol precipitation with a 1:1 ratio. RNA pellets were resuspended in RNAse-free water and digested with DNAse I for 20 min at room temperature (Zymo). DNAse I-treated RNA was purified with an RNA Clean and Concentrator-25 Kit (Zymo). A total of 10–12 μg RNA was reverse transcribed using the STARR-seq-specific reverse transcription primer (Supplementary Table 1). For each microliter of Maxima H minus reverse transcriptase (Thermo Fisher Scientific), 3 μg input was used. Reverse transcription reactions were diluted 1:4 with RNAse and DNAse-free water before running the splice-junction PCR using the STARR-seq thio-splice-junction primer and a custom reverse primer (Supplementary Table 1). About one-third of the RT reaction was amplified for 19–22 cycles. Illumina barcodes were added by a final PCR with eight cycles following Illumina Nextera guidelines. After each amplification step, a cleanup using AMPure XP magnetic beads (Beckman Coulter) was performed. Plasmid input libraries were amplified from the plasmid pool directly with Nextera Read 1 and Read 2 indexing primers (600 ng plasmid, six separate reactions, eight cycles). Plasmid and RNA libraries were sequenced on a NEXTSeq or MiSeq (Illumina) desktop sequencer at EPFL’s Gene Expression Core Facility with a 75- or 150-cycle kit.Barcodes were assigned to fragments using custom Python (version 3.9.5) and R (version 4.1.0) scripts. Barcodes mapping to more than one fragment or having less than five counts in the plasmid library were discarded. STARR-seq activity was defined as the log2 RNA to DNA ratio for each barcode. The fragment-level activity was derived by averaging across barcodes linked to each fragment. Normalized activities were derived by subtracting the average log2 ratio of the respective ‘no-TFBS’ sequence contexts alone (Supplementary Methods).To compute cooperativity for heterotypic multimotif combinations, the following two approaches were considered: expressing cooperativity as the difference between normalized, heterotypic fragment activities and (1) the sum of the underlying monomeric motif activities, or (2) the weighted average of the homotypic motif activities with equal length. For an example and more detail, see Supplementary Methods.Genome-integrated STARR-seqAn endogenous version of the STARR-seq assay was created by adding a loxP and a mutant lox2272 site flanking the origin of replication (ori) and the polyadenylation site of the aforementioned STARR-seq library carrying the homotypic and heterotypic motif combinations of 8 TFs with a 5 bp spacer (for a description of the assembly strategy, see Supplementary Methods and Supplementary Table 1). The resulting plasmid pool was transformed into One Shot OmniMax 2T1 high-efficiency competent cells (Invitrogen) and spread on Ampicillin plates to achieve ~30,000 individual clones and DNA was extracted as for the episomal STARR-seq library.To assay endogenous STARR-seq activity, we established a stable MEC1 cell line (MEC1 cells were purchased from DSMZ; ACC 497) expressing Cre recombinase under the control of a blasticidin selection marker and containing a Cre-recombinase-mediated cassette-exchange landing pad (loxP and lox2722 sites flanking a CMV-GFP). The cassette was integrated within the AXIN2 gene locus where it replaces the TSS as well as the 3 kb regulatory sequence upstream of the TSS (loxP downstream AXIN2 sequence, GCCGCCGGGCGGCCCCGAAATCCATCGCTC and lox2272 upstream AXIN2 sequence, CTGCGACTGTAGCAAGAGGGGACTGGGACT; locus described in ref. 46). Genetically modified MEC1 cells were cultured to ~50 million cells in IMDM media (Gibco), with 10% fetal bovine serum, 1% penicillin/streptomycin and 5 μg ml−1 blasticidin. Eight transfections were performed using the same procedure as for the episomal STARR-seq libraries. Transfected cells were cultured for another 10 days to allow for the removal of GFP in cells with successful cassette exchange. After 10 days, around 70 × 106 cells were sorted on either FACSAriaII or FACSAriaFusion flow cytometers (BD Biosciences), resulting in ~2 Mio GFP− cells. Sorted cells were cultured for another 8–9 days before extracting genomic DNA (gDNA) and mRNA from a total of 5 million cells in triplicates using the mini AllPrep DNA/RNA extraction kit (Qiagen). A total of 20–24 μg of total RNA per replicate was reverse transcribed using the STARR-seq-specific RT primer (Supplementary Table 1). For each microliter of Maxima H minus RT (Thermo Fisher Scientific), 3 μg input was used. The resulting cDNA was purified with a Clean and Concentrator Kit (Zymo) before amplification. The total amount of cDNA and gDNA for each replicate was amplified using the reverse primer already used for the episomal STARR-seq assay (Supplementary Table 1) and either the STARR-seq thio-splice-junction primer (cDNA) or a primer targeting the STARR-seq intron (gDNA; Supplementary Table 1). PCRs were run at 72 °C and 62 °C for a total of 20–21 cycles. Correct sizes of spliced mRNA and nonspliced gDNA products were assessed using gel electrophoresis. Illumina barcodes were added using nine additional cycles. After each amplification step, PCR products were cleaned up using AMPure XP magnetic beads (Beckman Coulter). Plasmid input libraries were amplified as described for the episomal STARR-seq assay, and plasmid, gDNA and mRNA libraries were sequenced on a NextSeq Illumina desktop sequencer at EPFL’s Gene expression Core Facility with either a 75- or a 150-cycle kit.Barcode-to-fragment mapping was executed as described for the episomal assay. gDNA and mRNA counts were normalized by sequencing depth using a ‘reads per million (RPM) sequenced’ conversion. The three no-TFBS sequence contexts were considered indiscriminately in downstream analyses. To compute endogenous activities, only barcodes with at least two RPM across all three gDNA library replicates were retained. To compute the fraction of BCs with nonzero mRNA levels, the number of unique BCs with a nonzero mRNA count in at least one of the three replicates was divided by the total number of BCs in the gDNA sample. Endogenous, fragment-level activities were computed by taking the average mRNA over gDNA ratio across the three replicates first, before averaging across respective BCs. To minimize effects due to extreme outliers, the top 5% of barcode-level activities were removed. In addition, fragments with less than ten unique expressed barcodes were excluded. Normalization of fragment-level activities was done as described for the episomal assay. BC-level endogenous activities were not normalized to account for the large number of ‘zero’ RNA counts.Prediction of DNA accessibility using EnformerWe used the pretrained Enformer model (https://tfhub.dev/deepmind/enformer/1; model head number 69) to predict DNA accessibility in GM12878 on an NVIDIA GeForce RTX 3090 GPU for each genome-integrated STARR-seq fragment, padded with the genomic sequence surrounding the landing pad (AXIN2 locus; Supplementary Methods). A two-sided Mann–Whitney test was performed to compare predicted DNA accessibility across inserts favored in episomal versus endogenous STARR-seq assays.Computing TF domain propertiesInformation on TF domains was extracted from Ensemble (EnsDb.Hsapiens.v86). DNA-binding domains were removed from the full-length sequence before predicting trans-activation domains (TADs) and low complexity domain scores (approaches described previously53,56 and summarized in Supplementary Methods). Significance between TF classes was assessed using a Fisher’s exact test (TAD presence) and a two-sided Mann–Whitney test (low complexity score), respectively.Association between TFs and epigenomic statesFastq files for H3K27ac and BRD4 ChIP–seq experiments in GM12878 cells were retrieved from ENCODE (ENCFF000ASP, ENCFF000ASU for H3K27ac) or the sequence read archive (SRA) (SRR1636861 for BRD4) and processed as described for the ATAC–seq data but considering ±650 bp around the peak center. A log-linear model was used to relate the enhancer IP signal to TFBS top scores (see Supplementary Methods for a full description). The t-statistic of the TFBS score coefficient was used as the evaluating metric.Defining independent and CM enhancer pairscaQTL enhancers were split into LOCAL leads/CM leads based on whether they were contained within the CM file (‘dag.txt’ in ref. 45). CM leads were subdivided into correlating and anticorrelating, depending on the effect direction of their linked, dependent enhancers (see ‘dag.txt’). For a detailed explanation, see Supplementary Methods.LOCAL and correlated CM leads were matched with respect to GC content and the SNP ‘potency’ (PcaQTL45) using 50 subsampling steps (Supplementary Methods). To create enhancer pairs (LOCAL lead and LOCAL neighbor; CM lead and CM dependent), we additionally required distance-matching (Supplementary Methods). For illustration purposes, we chose the subsample with the assessed enhancer property closest to the sample average.CM quality controlCM presence in isogenic cell lines was confirmed by computing the concordance of effect-size direction across SNPs present in phased enhancer pairs in GM12878 (Supplementary Methods). To establish a quantitative relationship, we used a linear model to predict the effect size of SNPs in neighbor/dependent enhancers as a function of the lead caQTL SNP, controlling for peak and SNP-to-peak-center distances and enhancer type (Supplementary Methods). To rule out bias introduced by 3D genome topology, we downloaded information on topologically associating domains in GM12878 from ENCODE (ENCFF788UTU), lifted coordinates to the hg19 genome version (BSgenome.Hsapiens.UCSC.hg19, version 1.4.3) and compared domain inclusion across the different enhancer-pair types (Fisher’s exact test). To compare the CTCF boundary signal between enhancer pairs, CTCF ChIP–seq data for GM12878 was downloaded from ENCODE and the maximum IP coverage between peaks was compared (Mann–Whitney test). CTCF motif orientation surrounding enhancer pairs was extracted using the searchSeq function within the R package TFBSTools. For a detailed description, see Supplementary Methods.Evaluating TF binding and accessibility differences in CMsTo compare TFBSs, we z-transformed top scores using the random enhancer sample as a baseline. For each TF, we computed average z scores within a given enhancer type by averaging across enhancers and across the 50 samplings (Supplementary Methods). For anticorrelated CMs, comparisons were performed against scrambled enhancer sequences.Available TF ChIP–seq data that had a linked HOCOMOCO motif (48 TFs) were downloaded from ENCODE (see Data availability section; GM12878). Processing was done in the same way as was done for enhancer accessibility scores. Differences in occupancy across LOCAL neighbors and CM dependents were computed by averaging the IP signal within enhancer types before computing the ratio between them.Enhancers were split by type (caQTL/no-effect/CM, etc.) and the caQTL genotype in GM12878 (phased genome information was used to assign genotypes for neighbor/dependent enhancers). GM12878 ATAC–seq coverage was compared across enhancer types and genotypes using a two-sided Mann–Whitney test.Assessing BETi sensitivity in CMsPaired-end RNA-seq FASTQ files belonging to replicates of either JQ1 or DMSO-treated MEC1 cells were retrieved from the SRA (SRR7815327, SRR7815329, SRR7815330, SRR7815331, SRR7815332 and SRR7815333). Paired-end reads were aligned to the hg19 (GRCh37.75) reference genome using the STAR aligner76, sorted, indexed and deduplicated with SAMtools69 and loaded into R using the ‘Rsubread’ package77. The ‘DESeq2’ package78 was used for differential gene expression analysis, keeping genes with at least ten counts across all replicates.Enhancer-to-gene mapping was done by finding the closest TSS. Distances to the TSS were compared between LOCAL leads and CM leads using a Student’s t-test. Annotation data for super-enhancers and active B cell enhancers were retrieved from the SEA62 (v3.0). Enhancers were defined as overlapping if they shared at least 200 bp. Specifics on different enhancer-type intersections and subsamplings are described in Supplementary Methods. Significance assessment between log2(FCs) in expression across groups was done using a two-sided Mann–Whitney test.Statistics and reproducibilityAll statistical analyses were carried out in R, and figures were generated using the ggplot2 package (version 3.4.2). Raw P values are visualized as NS = P > 0.05, *P < 0.05, **P < 0.01, ***P < 0.001 and ****P < 0.0001. P values < 2.2 × 10−16 are reported as ‘P < 2.2 × 10−16’, the default cutoff in R. P values are unadjusted unless otherwise indicated. Where applicable, multiple testing correction was performed and is indicated in the respective Methods or Supplementary Methods. For all comparisons across TF motif classes, the following sample sizes were given: n = 151 for unlabeled TF motifs, n = 120 for USTs, n = 18 for independent-initiator, n = 44 for context-initiator and n = 68 for context-only motifs. For the Mann–-Whitney and Student’s t-test, an unpaired, two-sided test was performed unless otherwise specified.No statistical method was used to predetermine the sample size. No data were excluded from the analyses with filtering steps specified in the respective Methods. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles