Position-dependent function of human sequence-specific transcription factors

Experimental methodsCell culture, siRNA and mRNA transfectionsU2OS, HepG2, HEK293T or Vero E6 cells were grown at 37 °C with 5% CO2 in DMEM (Cellgro) supplemented with 10% FBS (Gibco), 50 U penicillin and 50 μg streptomycin per ml (Gibco). For siRNA transfection, cells were washed once with PBS (Gibco), trypsinized for about 5 min and then washed twice with PBS (Gibco) by centrifugation at 400g for 5 min. Subsequently, about 3 million cells were resuspended in 150 µl gene pulser electroporation buffer (Bio-Rad) and siRNAs in 1× siRNA buffer (60 mM KCl, 6 mM HEPES pH 7.5, 0.2 mM MgCl2). The following siRNAs were used: M-011796-02-0005 siGENOME human YY1 (7528) siRNA, M-017924-01-0005 siGENOME human NRF1 (4899) siRNA and siGFP (CCACUACCUGAGCACCCAGU69) as a control. For transfection of the dominant-negative NRF1 mutant (AA1-304)45, the sequence was cloned into pGEM-T and mRNA was synthesized using the mMESSAGE mMACHINE T7 Transcription Kit (Thermo Fisher Scientific). siRNAs and dnNRF1 were used at a final concentration of 10 μM. The mixture was transferred to a 0.2 cm cuvette and pulsed once at 250 V for a constant 20 ms. After electroporation, 1 ml complete growth medium was added and cells grown for 24 h in a 6 cm dish. To collect RNA, the plates were washed three times using ice-cold PBS with 1 ml TRIzol reagent added. Cells were scraped and RNA was extracted as described by the manufacturer with addition of 1 μl 15 mg ml−1 GlycoBlue coprecipitant (Thermo Fisher Scientific).Mouse BMDMsThis study used total RNA originally collected from macrophages generated from C57BL/6 and SPRET strains of mice as part of a previous study. The derivation of BMDMs and their treatment with KLA for 1 h were performed as described previously50. Total RNA from these samples was used to perform csRNA-seq as described below.Western blotCells were lysed in 1× NuPAGE LDS sample buffer (Thermo Fisher Scientific), sonicated for about 30 s and then incubated at 95 °C for 5 min under 2,000 rpm agitation. The samples were then centrifuged for 5 min at 21,000g and about 10 μg total protein was loaded on a NuPAGE 4 to 12% Bis-Tris gel (NP0321BOX, Thermo Fisher Scientific). The gel was washed for 5 min in water, transferred for 90 min at 200 mA in 1× NuPAGE transfer buffer (NP0006, Thermo Fisher Scientific) and blocked for 40 min in ~1% non-fat dry milk in TBS. Original data are shown in Supplementary Fig. 4. Primary antibodies were allowed to bind at 4 °C overnight. The following antibodies were used: anti-YY1 (Santa Cruz, sc-7341, HRP YY1 (H-10), 1:200), anti-NRF1 (Abcam, ab55744, 1:1,000) and anti-β-Actin (Cell Signaling, D6A8, 8457S, rabbit monoclonal antibody, 8457, 1:2,500). Western blots were quantified using Fiji (v.1.53j) with mean grey value only.csRNA-seqcsRNA-seq was performed as described previously21. Small RNAs of around 20–60 nucleotides were size-selected from 0.4–2 µg of total RNA by denaturing gel electrophoresis. A 10% input sample was taken aside and the remainder enriched for 5′-capped RNAs. Monophosphorylated RNAs were selectively degraded by 1 h incubation with Terminator 5′-phosphate-dependent exonuclease (Lucigen). Subsequently, RNAs were 5′-dephosporylated through 90 min incubation in total with thermostable QuickCIP (NEB) in which the samples were briefly heated to 75 °C and quickly chilled on ice at the 60 min mark. Input (sRNA) and csRNA-seq libraries were prepared as described previously70 using RppH (NEB) and the NEBNext Small RNA Library Prep kit, amplified for 14 cycles and sequenced single-end for 75 cycles on the Illumina NextSeq 500 system.ChIP–seqChIP–seq was performed as described previously71. A total of 3 × 106 U2OS cells was fixed for 10 min with 1% formaldehyde/PBS, the reaction was quenched by adding 2.625 M glycine, cells washed twice with ice-cold PBS, snap-frozen in 1 million cell pellets and stored at −80 °C. For dnNRF1–HA ChIP–seq, cells were collected 12 h after electroporation with 3 µg IVT dnNRF1-HA mRNA as described above. Fixed cells were thawed on ice, resuspended in 500 µl ice-cold buffer L2 (0.5% Empigen BB, 1% SDS, 50 mM Tris/HCl pH 7.5, 1 mM EDTA, 1 × protease inhibitor cocktail) and chromatin was sheared to an average DNA size of 300–500 bp by administering 7 pulses of 10 s duration at 13 W power output with 30 s pause on wet ice using a Misonix 3000 sonicator. The lysate was diluted 2.5-fold with 750 µl ice-cold L2 dilution buffer (20 mM Tris/HCl pH 7.5, 100 mM NaCl, 0.5% Triton X-100, 2 mM EDTA, 1× protease inhibitor cocktail), 1% of the lysate was kept as ChIP input, and the remainder was used for immunoprecipitation using the following antibodies: dnNRF1-HA (2 μg anti-HA, Abcam, ab9110), NRF1 (2 μg, Abcam, ab175932) and YY1 (2 μg, ActiveMotif, 61980) and 20 μl of Dynabeads Protein A while rotating overnight at 8 rpm and 4 °C. The next day, the beads were collected on a magnet, washed twice with 150 µl each of ice-cold wash buffer I (10 mM Tris/HCl pH 7.5, 150 mM NaCl, 1% Triton X-100, 0.1% SDS, 2 mM EDTA), wash buffer III (10 mM Tris/HCl pH 7.5, 250 mM LiCl, 1% IGEPAL CA-630, 0.7% deoxycholate, 1 mM EDTA) and TET (10 mM Tris/HCl pH  7.5, 1 mM EDTA, 0.2% Tween-20). Libraries were prepared directly on the antibody/chromatin-bound beads. After the last TET wash, the beads were suspended in 25 μl TT (10 mM Tris/HCl pH 7.5, 0.05% Tween-20), and libraries were prepared using NEBNext Ultra II reagents according to the manufacturer’s protocol but with reagent volumes reduced by half, using 1 µl of 0.625 µM Bioo Nextflex DNA adapters per ligation reaction. DNA was eluted and cross-links were reversed by adding 4 μl 10% SDS, 4.5 μl 5 M NaCl, 3 μl EDTA, 1 μl proteinase K (20 mg ml−1), 20 μl water, incubating for 1 h at 55 °C, then overnight at 65 °C. DNA was cleaned up by adding 2 μl SpeedBeads 3 EDAC in 62 μl of 20% PEG 8000/1.5 M NaCl, mixing and incubating for 10 min at room temperature. SpeedBeads were collected on a magnet, washed twice by adding 150 μl 80% ethanol for 30 s each, collecting beads and aspirating the supernatant. After air-drying the SpeedBeads, DNA was eluted in 25 μl TT and the DNA contained in the eluate was amplified for 12 cycles in 50 μl PCR reactions using the NEBNext High-Fidelity 2× PCR Master Mix or the NEBNext Ultra II PCR Master Mix and 0.5 μM each of primers Solexa 1GA and Solexa 1GB. Libraries were cleaned up as described above by adding 36.5 μl 20% PEG 8000/2.5 M NaCl and 2 μl Speedbeads, two washes with 150 μl 80% ethanol for 30 s each, air-drying beads and eluting the DNA into 20 μl TT. ChIP library size distributions were estimated after 2% agarose/TBE gel electrophoresis of 2 μl of library, and library DNA amounts measured using the Qubit HS dsDNA kit on the Qubit fluorometer. ChIP input material (1% of sheared DNA) was treated with RNase for 15 min at 37 °C in EB buffer (10 mM Tris pH 8, 0.5% SDS, 5 mM EDTA, 280 mM NaCl), then digested with proteinase K for 1 h at 55 °C and cross-links were reversed at 65 °C for 30 min to overnight. DNA was cleaned up using 2 μl SpeedBeads 3 EDAC in 61 μl of 20% PEG 8000/1.5 M NaCl and washed with 80% ethanol as described above. DNA was eluted from the magnetic beads with 20 μl of TT and library preparation and amplification were performed as described for the ChIP samples. Libraries were sequenced single-end for 75 cycles to a depth of 20–25 million reads on the Illumina NextSeq 500 instrument.TSS-MPRATwist Bioscience insert library cloning. Insert library (10 ng) was PCR amplified in 50 µl mastermix (25 µl Q5 2× MM, 0.25 µl 100 µM pMPRA1-LH (5′-GGTAACCGGTCCAGCTCA), 0.25 µl 100 µM pMPRA1-RH (5′-CGTGTGCTCTTCCGATCT)) under the following conditions: 30 s at 98 °C; followed by 2× cycles of 10 s at 98 °C, 20 s at 65 °C and 15 s at 72 °C; followed by one more cycle with access primers and finally 2 min at 72 °C. The PCR product (~200 bp) was run on a 2% agarose gel, then gel-extracted and cleaned up using PureLink Quick Gel Extraction Kit (Invitrogen; K210012). The library pool was Gibson-assembled into BsaI-linearized pTSS-MPRA-Empty plasmid57. This reporter plasmid is based on the background-reduced pGL4.10 reporter backbone (Promega), with a cloning site harbouring the 18 bp pMPRA1-LH sequence followed by tandem BsaI sites for linearization and the TruSeq Read 2-compatible pMPRA1-RH sequence and a downstream landing site for reverse transcription primer RS2, and an eGFP ORF replacing the luciferase 2 gene. pMPRA1 (400 ng) was digested with BsaI in 20 µl (1 µl BsaI, 2 µl CutSmart buffer (NEB)) at 37 °C for 1 h and linearized plasmid was gel-extracted from an agarose/TBE gel using the PureLink Quick Gel Extraction Kit (Invitrogen; K210012). Amplified library was Gibson-assembled into cut plasmid using NEBuilder HiFi 2× master mix with a fivefold molar excess of the library at 50 °C for 1 h in a 4 µl total volume.Transformation. NEB 5-alpha (10 µl) chemically competent Escherichia coli (high efficiency, NEB, C2988) were transformed with 1 µl of Gibson assembly reaction, mixed and placed on ice for 30 min. Cells were subjected to heat-shock at 42 °C for 30 s, and then placed onto ice again for 5 min. A total of 950 µl of room temperature SOC was added and the cells were then incubated at 37 °C for 1 h while mixing at 250 rpm. Then, 200 µl of SOC culture was added to 200 ml of LB broth containing 100 µg ml−1 carbenicillin and agitated at 37 °C, 225 rpm for 16–18 h before isolating plasmids using the PureLink HiPure Plasmid Maxiprep Kit (Invitrogen; K210006). Note that, for higher-diversity libraries such as fragmented genomes or scrambled oligos, precipitate your assembled plasmid and use electroporation.Transfection. About 800,000 HEK293T cells were seeded into 3 ml DMEM (Cellgro, supplemented with 10% FBS (Gibco), 50 U penicillin and 50 μg streptomycin per ml (Gibco)) 24 h before transfection in a 6 cm dish and grown at 37 °C with 5% CO2. For each plate and construct, 25 µg plasmid DNA was transfected using Lipofectamine 3000 (Thermo Fisher Scientific) as described by the manufacturer. After 8 h, cells were washed three times with PBS (Gibco) and RNA was extracted using 1 ml Trizol (Thermo Fisher Scientific).5′-capped RNA enrichment. In total, 10–15 µg RNA (one-third of the total) was dissolved in 15 µl TE′T (10 mM Tris pH 8.0, 0.1 mM EDTA, 0.05% Tween-20), heated to 75 °C for 90 s, then quickly chilled on wet ice. Non-capped RNAs were dephosphorylated with Quick CIP (NEB) and DNA digested by adding 25.25 µl MM1 (double-distilled H2O + 0.05% Tween-20, 5 µl CutSmart buffer, 0.75 µl SUPERase in RNase inhibitor (20 U μl−1, Thermo Fisher Scientific), 0.5 µl RQ1 DNase (Promega), 2 µl Quick CIP). The sample was mixed well and incubated at 37 °C for 90 min. For more complete dephosphorylation of uncapped transcripts, RNA was denatured a second time in MM1 at 75 °C for 30 s in a prewarmed water bath and then quickly chilled on ice for 2 min before incubating the sample at 37 °C for another 30 min. After double cap-enrichment, RNA was purified by adding 500 µl Trizol LS, mixed and then adding 140 µl TE′T and 140 µl CHCl3 + IAA (24:1, Sigma Aldrich). The samples were vortexed vigorously and subsequently centrifuged for 10 min at 12,000g at room temperature (21 °C; allowing the CIP to move to the lower phase). After phase separation, the top layer was taken, and RNA was precipitated by mixing first with 1/10 vol 3 M NaOAc and then 1 vol 100% isopropanol. The mixture was incubated at −20 °C for at least 20 min to overnight, then RNA precipitated by centrifuging at >21,000g for 30 min at 4 °C. The supernatant was removed, the samples centrifuged once more and all of the remaining liquid was removed before washing the pellet in 400 µl 75% ethanol by inversions followed by centrifugation as before. All ethanol was completely removed, and the pellet was air-dried at room temperature for around 3–5 min.Library preparation. RNA pellets were resuspended in 5 µl TE′T, heated 75 °C for 90 s, then quickly chilled on ice. To remove 5′ caps, we next added 10 µl DecappingMM (3.25 µl double-distilled H2O + 0.05% Tween-20, 1.5 µl 10× T4 RNA ligase buffer, 4 µl PEG 8000, 0.25 µl SUPERase in RNase inhibitor and 1 µl RppH (NEB)) and incubated the samples at 37 °C for 1 h. After decapping, 5′ adapters were ligated by T4 RNA ligase 1 using 10 µl L1MM (1 µl 10× T4 RNA ligase buffer, 2 µl 10 mM ATP, 1.5 µl 10 µM 5′ adapter, 4.5 µl PEG 8000 and 1 µl T4 RNA ligase 1 (NEB)) and incubating at room temperature (21 °C) for 2 h. For better results, we next repeated the Trizol clean-up as described in the ‘5′-capped RNA enrichment’ section.RNA pellets were next resuspended in 7 µl annealing MM (1 µl 20 mM RS2 primer (5′-AGCGGATAACAATTTCACACAGGA-3′), 2 µl 700 mM KCl and 4 µl TET (10 mM Tris pH 7.5, 1 mM EDTA, 0.05% Tween-20)) and incubated at 75 °C for 90 s followed by 30 min at 56 °C and then cooled down to room temperature. Next, 13 µl reverse transcription MM (7.5 µl double-distilled H2O + 0.05% Tween-20, 1 µl reverse transcriptase buffer (homebrew, 500 mM Tris-HCl pH 8.3, 30 mM MgCl2), 2 µl 0.1 M DTT, 2 µl 10 mM dNTPs, 0.5 µl SUPERase in RNase inhibitor and 1 µl Protoscript II (NEB)) were added and samples incubate at 50 °C for 1 h.The samples were amplified (95 °C for 3 min; then 14 cycles of 95 °C for 30 s, 62 °C for 30 s, 72 °C for 45 s; and 72 °C for 3 min) at 50 µl volume (25 µl 2× Q5 MM (NEB), 2.8 µl 5 M betaine (Sigma-Aldrich), 2 µl 10 µM 3′ barcoding primer, 0.2 µl 100 µM 5′ primer). Note that the extended extension time is important to ensure that all PCR products are completely amplified. After PCR, 1 µl of 20 mg ml−1 RNase A was added and the reaction was incubated at 37 °C for 30 min. PCR reactions were purified using 1.5 vol of beads (2 µl Sera-Mag carbonylated beads (Cytiva), 34 µl 5 M NaCl, 37.5 µl 40% PEG 8000) and sequenced paired-end 50/30 on the NextSeq 500 system.Data analysisHOMER2HOMER2 enables investigation of DNA sequence fragments and TF binding sites enrichment accounting for the general nucleotide content of the input fragments (for example, total GC content) and position-dependent nucleotide biases (Extended Data Fig. 1 and Supplementary Fig. 5). While HOMER2 is used to analyse sequences and account for sequence biases near TSSs, similar to its predecessor HOMER23, the software package can be used for a wide range of data analysis. HOMER2 is available online (http://homer.ucsd.edu/homer2/; HOMER2 is fully integrated into HOMER starting with v.5).Position-dependent background selection and motif enrichment in HOMER2. To account for the influence of position-dependent sequence bias on the calculation of TF binding motif enrichment, we developed improvements in HOMER to select random genomic sequences that contain the same position-dependent sequence features present in a set of target sequences (for example, sequence anchored at thee TSS). Previously, when HOMER performed motif-enrichment calculations, random sequences from the genome were selected to construct an empirical background set such that the overall distribution of GC% per sequence in the background set matched that of the target sequence set (Extended Data Fig. 1). The primary purpose was to address sequence biases present in regulatory elements that overlap CpG islands, which have different overall sequence characteristics from other regions of the genome. HOMER2 can now select background sequences that preserve positional preferences for nucleotide composition, while still matching the overall GC% fragment distribution in the dataset (Extended Data Fig. 1). Position-dependent nucleotide composition can be considered for different k-mer lengths, that is, k = 1 (simple nucleotide frequency), k = 2 (dinucleotide frequency), k = 3 (trinucleotide frequency) and so on. k = 2 was used in this study. This selection is restricted to datasets with a fixed sequence length to unambiguously assess position-specific information relative to a defined anchor point. In addition to the sequence-selection procedure outlined below, HOMER has also been updated to generate synthetic sequences based on a Markov model describing k-mer transitions in a position-dependent manner to create a background dataset of synthetic sequences with the desired characteristics.First, target sequences (for example, ±200 bp from TSS locations) are assessed for their overall GC% content and positional k-mer content (Supplementary Fig. 5). Target sequences are then sorted by their overall GC% and segregated into n bins corresponding to increasing ranges of overall GC% content (n = 10 for this study). Sequences from the genome matching these GC% ranges are identified and putatively assigned to the appropriate GC-bin. Next, the genomic sequences assigned from each bin are then sampled to generate a set of background sequences that matches the positional k-mer frequencies of the target sequences in each corresponding GC bin. Background sequence selection in each GC-bin is performed using an iterative gradient descent approach that progressively removes sequences until the final desired number of background sequences per bin is reached. During each iteration, the overall similarity between the positional k-mer frequency in the target and background sets is computed. Each target and background sequence is then scored against these k-mer frequency differences based on the k-mers present in each sequence at each position and adding the differences in their frequencies (linear combination). Background sequences for the next iteration are then randomly sampled on the basis of the relative fraction of target sequences that have a similar overall difference score, which attempts to match the same overall distribution of the target sequences. This process is repeated until the differences in k-mer frequencies approaches zero or a maximum number of iterations is reached. Once the iterative selection process for background sequences in each GC bin is complete, the sequences are combined into a final background sequence set and the distribution of the overall GC% and position-dependent k-mer frequencies are calculated and compared to the original target sequences (Supplementary Fig. 5). These sequences can then be used to more accurately consider TF motif enrichment (below) or be exported for other applications.To calculate motif enrichment, target and background sequences are scanned for motif matches using HOMER to generate a complete table of motif occurrences. For any given interval (for example, −50 to −40 relative to the TSS), the total number of target and background motif occurrences within that range are tabulated and their enrichment is calculated using the Fisher exact test (hypergeometric distribution). In cases in which there are proportionally less motif occurrences in the target sequences compared with the background, the depletion is noted and 1 − P is reported. This is performed over all positional ranges and for each motif queried, and the resulting P values are further corrected for multiple-hypothesis testing using the Benjamini–Hochberg method. The corrected log-transformed P values are then reported, with comparisons containing depleted values reported as −1 × log[P] to reflect depletion (versus 1 × log[P] for enrichment). De novo motif enrichment is performed by applying the original HOMER search algorithm using background sequences generated by HOMER2.Motif analysis. To unbiasedly identify the most strongly enriched or depleted TF binding sites associated with transcription initiation (Fig. 1 and Extended Data Figs. 2 and 3), TSRs from U2OS cells were analysed from −150 to +100 bp relative to the TSS using HOMER2 positionally matched sequences from random genomic regions (GCbins = 10, kmer = 2). Maps of known motif enrichment were calculated for all 463 TF motifs in the HOMER known motif database for each strand separately and at each bp using 3 bp windows (Fig. 1d). Enrichment at human LCL TSSs was calculated on both strands every 10 bp using 30 bp windows (Fig. 5a) to directly compare with the analysis based on genetic variants (Fig. 5a). Motif heat maps were created by clustering the log-transformed Padj values using the correlation coefficient as a distance metric (Cluster v.3.0)72 and visualizing the resulting heat map using Java TreeView73 (v.1.1.6r4). Motif occurrences, including histograms showing the density of binding sites relative to the TSS (reported as motifs per bp per TSS), and the average nucleotide frequency was calculated using HOMER’s annotatePeaks.pl tool and visualized with Java TreeView, Excel (v.16.83) or R (v.4.2.2). TSSs were assigned as a promoter TSS if their position was on the same strand and within 200 bp of a GENCODE annotated TSS. Promoter antisense TSSs were defined as those on the opposite strand in the range of −400 to +200 relative to a GENCODE-annotated TSS. Promoter-distal (for example, enhancer) TSSs were defined as those that are found greater than 1 kb from a GENCODE-annotated TSS. Spectral analysis of TF binding sites was performed on TF binding sites found between −120 and −40 bp relative to the TSS, corresponding to the region where many TFs appear to exhibit cyclical patterns of positional preference. The power spectrum (Fourier analysis) was calculated for periods from 0 to 50 bp in 0.1 bp increments on each TF’s strand-specific binding profile. The resulting power spectra were normalized to their maximum value to facilitate comparison. To segregate TSSs on the basis of the presence of initiator core promoter elements, the genomic sequence adjacent to each TSS was scanned for a strand and position-specific match to the sequence IUPAC motif BBCA + 1BW (where the A + 1 defines the initiating nucleotide)74. TSSs with a match were considered to be Inr-containing TSSs.Analysis of TSSs and TF binding sites in the context of natural genetic variation. To assess how variation in TF binding site sequences relate to changes in TSS activity, we developed a framework for natural genetic variation analysis within HOMER2 that was inspired by MAGGIE51. First, variants in cis near TSSs with significant changes in activity are found (for example, <200 bp) and the alleles associated with higher activity are assigned as ‘active’, while their corresponding alternative alleles are assigned as ‘inactive’. If a variant overlaps a given TF binding site, the change in motif log odds score is calculated for that motif (active − inactive) and a distribution of motif score changes is created for all variants impacting that motif at sites within a given distance interval from the TSS. In MAGGIE’s original formulation, the null assumption underlying the nonparametric rank-sum significance calculations was that changes in motif score are independent of TSS activity, implying that the average of the distribution of motif scores should be zero. However, variants found near TSSs with differential activity do not follow a uniform pattern and may impact other sequence features that influence transcription initiation in a position-dependent manner (for example, core promoter elements) (Fig. 1a and Extended Data Fig. 5e,f). This implies that the expected changes in log odds scores for a given motif at a given distance from the TSS may follow a different distribution (that is, !=0).To more accurately assess how genetic variation impacts TF binding sites and TSS activity as a function of distance to the TSS, HOMER2 attempts to model the expected distribution of changes in motif log odds scores given the full distribution of variants observed relative to the TSS. This analysis is limited to single-nucleotide variants (SNVs/SNPs), which are evaluated independently from one another. First, a saturation mutagenesis scan is performed on each sequence to identify all of the positions where a match to a given motif may occur, and all of the potential differences in log odds scores as function of the variant and position are recorded. Then, for a given interval, an expected distribution of motif score changes is constructed taking the changes observed in the saturation mutagenesis analysis and then scaling their expected frequency by the total number of variants of each type (that is, A to C) observed at that position relative to the TSS. This expected distribution is then compared to the observed changes in motif scores from the actual variants and their sequences using nonparametric rank-sum tests (Mann–Whitney U-tests) to calculate the significance of the difference. This analysis is analogous to randomizing the sequences containing each variant while preserving the position and nucleotide identity of the variant relative to the TSS. After all motifs are evaluated at all intervals, the resulting P values are then corrected for multiple-hypothesis testing using the Benjamini–Hochberg method. Average changes in motif score at each position and overlap with GWAS-annotated variants are also reported.For this study, the analysis was performed for all 463 motifs in the HOMER2 known-TF-binding-site library using 30 bp overlapping windows evaluated every 10 bp from −150 to +100 bp relative to the TSS. Larger windows improve the sensitivity by capturing more binding sites of a given TF motif with DNA variants between strains to increase the sensitivity, while smaller regions improve the resolution of the analysis. When analysing differences between strains of mice, active alleles were defined by TSSs that were significantly differentially regulated between strains (see below). For analysing human variants, active alleles were defined on the basis of a positive slope from significant tssQTLs (|slope| > 0.1, P < 0.25).As an alternative approach, we performed a second analysis by directly comparing the sequences found in each mouse strain’s genome assembly using the original MAGGIE program (v.1.1.1, https://github.com/zeyang-shen/maggie; Supplementary Fig. 2c). This approach differs from that above in that it can assess structural variation and indels in addition to SNVs, but does not model the position-dependent changes in expected motif score differences due to the arbitrary types of sequence variation considered. For this alternative analysis, sequences from −150 to +100 bp relative to the differentially regulated TSS (>1.5 shrunken fold change in one mouse strain versus the other) were analysed. We applied MAGGIE multiple times using an overlapping windowed approach to analyse genomic sequences associated with specific distance intervals from the TSS, similar to our approach above. To perform the MAGGIE calculation for each windowed region and TF binding site, sequences corresponding to a given region relative to the TSS were extracted from either the C57BL/6/mm10 genome or from regions in the same distance range relative to the homologous TSS position in the SPRET genome. These regions were then scanned using HOMER to identify TSSs associated with a match to the given motif in at least one of the two mouse strains. These sequences were then analysed using MAGGIE to identify pairs for which strain specific mutations were associated with changes in TSS activity. The significance was reported as the log[P] reported by Maggie, which was then signed to indicate whether the association was more strongly associated with increasing (negative log[P] values) or decreasing (positive log[P] values) TSS activity. TF-binding-site enrichment heat maps were generated by combining signed log[P] value results across all TF binding site and TSS-motif distances and then clustering the values using the correlation coefficient (Cluster 3.0) and visualizing the resulting heat map using Java TreeView.TF–TF spacing and transcription initiation analysis. To analyse patterns of transcription initiation near pairs of TF binding sites (Fig. 4f), non-redundant binding sites for the first TF binding sites were first identified by scanning all TSRs from −300 bp to +300 bp using HOMER2. These sites were then scanned a second time from −300 bp to +300 to identify instances of the second TF binding sites, and each region containing both TF binding sites was then sorted based on the position of the second TF binding sites relative to the first. Note that, if multiple instances of one of the binding sites are found in the vicinity of the other sites, these regions will be represented multiple times in the list. The sorted regions, centred on the first TF binding site, were then used to generate TF initiation level heat maps using HOMER’s annotatePeaks.pl program using the parameter ‘-ghist’.MEIRLOP score-based motif analysis. To analyse how well each TF binding site associates with the activity level of TSS, we used MEIRLOP (v.0.0.16)75, analysing motif occurrences from −150 to +50 bp relative to the TSS and associating them with the total count of csRNA-seq reads (log2). MEIRLOP assesses the dinucleotide content of each sequence and models them as covariates when performing logistic regression. Statistically significant enrichment coefficients (Padj < 0.05) were reported along with their confidence intervals (Extended Data Fig. 2b; https://github.com/npdeloss/meirlop).MEPP positional enrichment scoring. To analyse how the spatial distribution of a TF binding site associates with changes in TSS activity, we used Motif Enrichment Positional Profiles (MEPP, v.0.0.1)76. For a given motif PWM (for example, NRF1 motif), MEPP assesses the positional enrichment of the motif relative to a set of scored sequences (for example, the log2-transformed fold change between the control siRNA and NRF1 siRNA conditions) that are anchored by a key feature (for example, the TSS). MEPP first calculates the positions of the TF binding site across all regions, generating a heat map to visualize the locations and PWM log-transformed odds scores (Fig. 1e (middle)). Positions are indicated relative to the centre of the PWM motif. At each motif position surrounding the key anchor feature, we calculate the partial Pearson correlation of a sequence’s score with the motif’s PWM log-odds score at that position (while controlling for GC content as a covariate in the calculation). The positional correlation is then presented as a profile below the heat map (Fig. 1e (bottom); https://github.com/npdeloss/mepp).csRNA-seq analysisGenome originating TSS location and activity levels were determined using csRNA-seq and generally analysed using HOMER21,23. Additional information, including analysis tutorials are available online (http://homer.ucsd.edu/homer/ngs/csRNAseq/index.html).csRNA-seq (small capped RNAs, ~20–60 nucleotides) and total small RNA-seq input sequencing reads were trimmed of their adapter sequences using HOMER (‘homerTools trim -3 AGATCGGAAGAGCACACGTCT -mis 2 -minMatchLength 4 -min 20’) and aligned to the appropriate genome (GRCh38/hg38, GRCm38/mm10) using STAR (v.2.7.10a)77 with the default parameters. Only reads with a single, unique alignment (MAPQ ≥ 10) were considered in the downstream analysis. Furthermore, reads with spliced or soft clipped alignments were discarded to ensure accurate TSSs. The same analysis strategy was also used to reanalyse previously published Start-seq and PRO-cap TSS profiling data to ensure the data were processed in a uniform and consistent manner, although different adapter sequences were trimmed according to each published protocol.Two separate transcription initiation analysis strategies were used in this study. In most cases, individual, single-nucleotide TSS positions were independently analysed. For a subset of analyses, we analysed transcription initiation levels in the context of TSRs, which comprise several closely spaced individual TSS. Individual TSS locations are useful for characterizing spacing relationships at single-bp resolution, whereas TSRs are more useful for describing the overall transcription activity at whole regulatory elements.To analyse TSSs at the single-nucleotide resolution, the aligned position of the 5′ nucleotide of each csRNA-seq read was used to create a map of putative TSS locations in the genome. To ensure that we use high-quality TSSs that could be reliably quantified across different conditions, only TSS locations with at least 7 reads per 107 total aligned reads across all compared replicates and conditions (for example, control siRNA, NRF1 siRNA and so on) were retained for further analysis21. Furthermore, any TSS that had higher normalized read density in the small RNA input sequencing was discarded as a likely false-positive TSS location. These sites often include miRNAs and other high-abundance RNA species that are not entirely depleted in the csRNA-seq cap enrichment protocol. To quantify the change in TSS levels between conditions, a unified map of confident TSS positions is first determined across the set of experiments to be compared. Then, the TSS activity levels are assessed for each replicate and each experimental condition by first counting the raw read coverage across each TSS and all experiments and normalizing the dataset using DESeq2’s rlog variance stabilizing transform (v.1.38.3)78. Changes in transcriptional activity were then reported as the log2-transformed fold change representing the difference between averaged rlog transformed activity levels across conditions (similar to a shrunken fold change). DESeq2 was used to identify significantly differentially regulated TSS, defined as TSS exhibiting a change of at least 1.5 fold and Padj < 0.05, unless otherwise noted.TSRs, representing loci with significant transcription initiation activity from one or more individual TSSs on the same strand from the same regulatory element (that is, peaks in csRNA-seq), were defined using HOMER’s findcsRNATSR.pl tool, which uses short input RNA-seq, traditional total RNA-seq and annotated gene locations to find regions of highly active TSSs and then eliminate loci with csRNA-seq signals arising from non-initiating, high-abundance RNAs that nonetheless are captured and sequenced by the method (further details are available in a previous study21). Replicate experiments were first pooled to form meta-experiments for each condition before identifying TSRs. Annotation information, including gene assignments, promoter distal, stable transcript and bidirectional annotations are provided by findcsRNATSS.pl. To identify differentially regulated TSRs, TSRs identified in each condition were first pooled (union) to identify a combined set of TSRs represented in the dataset using HOMER’s mergePeaks tool using the option ‘-strand’. The resulting combined TSRs were then quantified across all individual replicate samples by counting the 5′ ends of reads aligned at each TSR on the correct strand. The raw read count table was then analysed using DESeq2 to calculate normalized rlog-transformed activity levels and identify differentially regulated TSRs, similar to the analysis of TSSs78.In all cases, normalized genome browser visualization tracks for csRNA-seq data were generated using HOMER’s makeUCSCfile program using the ‘-tss’ option and visualized using either the UCSC Genome Browser79 or IGV80. Annotation of TSS/TSR locations to the nearest gene was performed using HOMER’s annotatePeaks.pl program using GENCODE as the reference annotation.Additional information about csRNA-seq analysis and tips for analysing TSS data is available at the HOMER website (http://homer.ucsd.edu/homer/ngs/csRNAseq/index.html).Analysis of TSSs across two strains of miceTo analyse how changes in genomic sequence impact TSS activity from two different strains of mice, we first took steps to ensure that TSS positions were conserved and detectable in both mouse strains to avoid analysing TSSs that may exhibit differential activity due to technical/analytical reasons, or TSSs found in non-homologous DNA. This was accomplished by ensuring that all csRNA-seq reads used in the analysis could be aligned to a single, unique location in the genomes of both mouse strains. Furthermore, the location that each csRNA-seq read aligns to in each genome must correspond to a homologous position in the full genome alignment, indicating that they represent the equivalent TSS positions in each strain. This latter filter helps to eliminate TSSs mapping to positions in repetitive or duplicated regions that may not be resolved in one or both genome assemblies.To identify valid TSS positions for the natural genetic variation analysis, csRNA-seq and small input RNA-seq reads were first trimmed to remove adapter sequences. Reads from each mouse experiment (regardless of the strain of origin) were aligned to both the C57BL/6 (GRCm38/mm10) and SPRET (GCA_001624865.1/SPRET_EiJ_v1) genomes using STAR with the default parameters. Only reads that aligned to a single, unique location (MAPQ ≥ 10) were considered further. Next, TSS positions representing the 5′ end of the reads were mapped to the other mouse strain’s genome using UCSC’s liftOver tool and the corresponding C57BL/6/SPRET liftOver files (http://hgdownload.cse.ucsc.edu/goldenpath/mm10/liftOver/mm10ToGCA_001624865.1_SPRET_EiJ_v1.over.chain.gz, https://hgdownload.soe.ucsc.edu/goldenPath/GCA_001624865.1_SPRET_EiJ_v1/liftOver/GCA_001624865.1_SPRET_EiJ_v1ToMm10.over.chain.gz). If the liftOver calculation yielded the same TSS location as the alignment from the other mouse strain, the read was retained for downstream analysis. Confident TSS locations, including DESeq2 rlog-normalized values and log2-transformed fold changes were then calculated based on the alignment positions reported in the mm10 genome as described in the sections above. Strain-specific differentially regulated TSSs used in the analysis of natural genetic variation were determined by DESeq2 using Padj of 0.25, resulting in 431,310 variant–TSS pairs for analysis.Analysis of tssQTLs from LCL PRO-Cap data Variant file merging and filtering. Per-chromosome VCF files containing genotyping data for the samples analysed previously63 were downloaded from the 1000 Genome project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/). Using bcftools81, these VCFs were then filtered for samples and variants observed from 67 individuals corresponding to PRO-cap samples from Gene Expression Omnibus (GEO) accession GSE110638. The variants in these VCFs were then normalized and named using their location and allelic data using bcftools. The per-chromosome VCFs were then merged, while trimming unobserved alternate alleles, removing sites without called genotypes and requiring minor allele counts of at least 1. These were further filtered to include only variants that were flagged as passing all upstream quality checks and were called with a minimum depth of 10. The resulting VCF was converted into PLINK format using plink2 (v.2.00a2.3LM)82, retaining only SNPs with less than 50% of genotype calls missing, and a minor allele frequency greater than 0.05. A set of genotype principal component analysis (PCA) covariates was also generated using plink2.LCL PRO-Cap data processing. To mitigate allele-specific alignment effects, a masked genome was created that set bases at the filtered SNP coordinates to Ns. We then trimmed adapter sequences from reads using fastp and aligned them to the masked genome using STAR using the default parameters. The resulting alignments were then processed using HOMER. Tag directories were then separated on the basis of whether they belonged to PRO-cap or PRO-seq experiments. To call TSSs, PRO-cap and PRO-seq reads were processed in the same manner as csRNA-seq and total small RNA-seq, respectively, to identify a unified set of human TSSs. We then obtained rlog-normalized from raw counts quantifying coverage of the 5′ ends of reads from each sample, using HOMER annotatePeaks.pl. To avoid bias from extreme outliers, we retained only TSSs with a minimum count of 10 reads in 10 samples. To control for broad genome-wide expression variance, we obtained a set of 50 expression PCA covariates.tssQTL calling. To call QTLs from TSS data, we used TensorQTL (v.1.0.3)83 to determine the link between filtered SNP genotypes and TSS expression phenotypes, while controlling for covariance from sex, genotype PCA and TSS expression PCA. TensorQTL was run in both ‘cis’ and ‘cis_nominal’ modes, with a cis window of 300 bp. We then analysed each cis-nominal QTL SNP within the framework of our natural genetic variation analysis, limiting our analysis to tssQTLs with P < 0.25 and |slope| > 0.1, leading to a total of 194,746 variant–TSS combinations used in the analysis.ChIP–seq analysisChIP–seq reads were aligned to the appropriate human genome (GRCh38/hg38) using STAR with the default parameters77. Only reads with a single, unique alignment (MAPQ ≥ 10) were considered in the downstream analysis. ChIP–seq peaks were determined using HOMER’s findPeaks program using ‘-style factor’. Quantification of ChIP–seq reads associated with peaks, annotation to the nearest annotated gene TSS, calculation of TF binding site presence (−100 to +100 relative to the peak centre) and visualization of normalized read pileups for the genome browser were all conducted using HOMER. The same analysis strategy used for our dnNRF1, NRF1 and YY1 ChIP–seq data was also used to reanalyse TF ChIP–seq data from ENCODE84 to ensure that the data were processed in a uniform and consistent manner.TSS-MPRA analysisThree different DNA insert designs were used for TSS-MPRA in this study. For each design, 400–500 DNA inserts were designed consisting of 155 bp of query DNA (described below) and redundantly coupled with 4 or 5 independent 11 bp barcodes optimized for their molecular properties and diversity85 to generate a total of 2,000 DNA inserts per design. Genome-encoded TSR sequences queried by TSS-MPRA were designed to capture the sequence from −113 to +42 bp relative to the primary TSS to capture most of the upstream TF binding sites and core promoter region. Key sequences used in the MPRA design are reported in Supplementary Table 2, and full TSS-MPRA design files and sequences are available at the GEO (GSE199431).To process TSS-MPRA results, raw RNA and DNA sequencing reads, corresponding to the RNA transcripts and input DNA library, respectively, were trimmed for the 5′ adapter sequence GGTAACCGGTCCAGCTCA on the R1 read using cutadapt v.3.4. The trimmed reads were aligned to the reporter library using STAR (v.2.7.10a)77, specifying an option to preclude soft-clipping at the 5′ end of R1 (STAR –outSAMattributes All –genomeDir library/tfsweep.STARIndex –runThreadN 12 –readFilesIn file.fastq –outFileNamePrefix star/Sweep_1. –alignEndsType Extend5pOfRead1 –outSAMtype BAM SortedByCoordinate). For DNA plasmid control samples, the uniquely aligned read pairs were counted to later scale transcriptional output. For RNA samples, uniquely aligned read pairs were further processed to identify exact TSSs, yielding a matrix of start sites per sequence position. Any alignments showing mismatches at their 5′ ends were not counted and reporter sequences with fewer than 50 total DNA alignments were also ignored. Specific DNA insert details and analysis approaches tailored for each TSS-MPRA design are described below.TF-binding-site insertion analysis. A total of 13 TSRs was selected from the human genome (Supplementary Table 2) and DNA inserts were designed to introduce 7 TF binding sites at positions −50, −20 and +25 bp relative to the primary TSS (Sp1, NRF1, NFY, YY1, p53, CTCF or a control sequence; Fig. 3b,c). Binding-site insertion replaced the endogenous sequence at each location to maintain the relative spacing of regulatory element DNA outside of the TF binding site insertion. For RNA samples, the transcriptional output of each sequence was summarized by counting alignments with start sites near the designated promoter region (±7 bp from the TSS). These values were scaled based on plasmid DNA levels and transformed as described above. Overall levels were reported as the log2-transformed ratio relative to the control binding site insertion that does not match any known TF binding sites.TF-binding-site sweep analysis. To unbiasedly assess the position-dependent impact of TF binding sites on transcription initiation levels, we first designed a synthetic promoter sequence that lacked matches to any of the known motifs in the HOMER TF motif library (Supplementary Table 2). TF binding sites corresponding to Sp1, NFY, NRF1 and YY1 were then inserted at 2 bp intervals along the length of the sequence (Fig. 4a–c). NRF1 was additionally inserted at the same 2 bp intervals into either the WT TOB2 enhancer or a mutant version lacking endogenous NRF1 binding sites (Fig. 4e). Binding-site insertion replaced the endogenous sequence at each location to maintain the relative spacing of regulatory element DNA outside of the TF binding site insertion. To analyse the impact of each TF binding site sweep, a scaling factor for each insert sequence was determined by calculating min(10,000/plasmid, 100). After multiplying the start-site counts by the scaling factor, a pseudocount of 1 was added and the values were log2 transformed. Replicates were then merged by averaging. Reference coverage was defined for each promoter–TF binding site pair by calculating the average position-wise signal across all of the sequences (Extended Data Fig. 7). The difference between the merged signal and reference coverage was smoothed for visualization using the R loess function with span parameter set to 0.1.TF-binding-site mutation analysis. In total, 133 TSRs were randomly selected and TF binding sites matching 20 different motifs were mutated to monitor their impact on transcription initiation patterns. In addition to the wild-type TSR sequence [−113, +42], a separate insert was designed for each motif where one or more TF binding sites were found. Sequences corresponding to the TF binding site were replaced with the same control sequence in each case starting at the 5′ end and continuing the length of the binding site (control sequence: TAACTGTAATACCTCCTGAAGTC). Only motifs with matches to at least 20 different TSRs were used in the analysis (Sp1, NFY, NRF1, YY1 and ETS; Fig. 5e and Extended Data Fig. 10d). DNA-scaled and log-transformed start site values were calculated as with the above site sweep analysis. For each motif, start positions were classified as enriched or depleted based on an overrepresentation analysis in genomic contexts (Padj < 0.01, data from U2OS motif enrichment analysis; Fig. 1d). TSS shifts were determined by finding the weighted mean TSS position per insert, then subtracting the mean position per mutant from the mean position of the relevant control.Reproducibility analysis. The reproducibility of MPRA results was assessed by comparing (1) the variation in initiation activity levels among different barcode replicates for the four TSRs displayed in Fig. 3b (Extended Data Fig. 6a); (2) comparing summary heat maps of the TSSs and their normalized activity levels captured by TSS-MPRA for a 2 bp incremental sweep of TF binding site sweeps (Sp1, NRF1, NFY and YY1) from −100 to +40 using four different barcode sets (Extended Data Fig. 7a–e); and (3) comparing TSS activity levels for a given DNA fragment across at least two biological replicates and between independent barcodes for each TSS-MPRA library (Supplementary Figs. 6 and 7).Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles