Gapped-kmer sequence modeling robustly identifies regulatory vocabularies and distal enhancers conserved between evolutionarily distant mammals

Enhancer and promoter regulatory vocabularies are conserved, yet enhancers rapidly evolveEnhancers are distal to transcription start sites (TSS) and harbor binding sites for TFs. The cell-specific expression of these TFs leads to cell-specific enhancer chromatin accessibility and transcriptional regulation. Over the past decades, the ENCODE consortium has generated thousands of DNase-seq experiments10,21,22, across diverse human and mouse cell/tissues, and this comprehensive library of experiments has allowed us to robustly and systematically identify enhancer elements. Chromatin-accessible peaks broadly fall into enhancer and promoter classes, which have different sequence features and conservation properties. To robustly define these classes, we use two genomic features: distance to the nearest TSS, and the cell specificity of DNase I hypersensitivity (Fig. 1A). We quantified the cell specificity of a DNase I hypersensitive site (DHS) as the fraction of all biosamples in the ENCODE database in which the DHS is inaccessible (Nhuman = 1270; Nmouse = 153). To discretize these classes, we classified all DHSs farther than 2 kilobases from the nearest TSS as distal (if not, proximal) and DHSs with cell specificity higher than 0.7 as cell-specific (if not, constitutive). This partitions DHSs into four classes: distal cell-specific, proximal constitutive, distal constitutive, and proximal cell-specific DHSs (section “Methods”).Fig. 1: Cell-specific enhancer and promoter gkm-SVM regulatory vocabularies are conserved across mammals, while enhancers rapidly evolve.A B-lymphocyte DHSs (N = 55,715) distance to nearest TSS and cell-specificity across 1270 biosamples. B Classification of B-lymphocyte DHSs by their distances to nearest TSS (proximal: <2 kb, distal: > 2 kb) and cell-specificity (cell-specific: DHS in <30% of all biosamples, constitutive: otherwise); average epigenetic signals around DHS peak centers by DHS class. C Pairwise comparisons of kmer-weight vectors, derived from enhancers (cell-specific distal DHS), for a pair of similar samples: human and mouse adult brain (Nkmer = 2,097,152; Pearson Corr. = 0.74). D Pairwise comparisons of gkm-SVM kmer-weight vectors, by Pearson correlation, across various human embryonic biosamples (NESC = 11, NThymus = 9, NStomach = 15, NHeart = 19, NKidney = 32, NBrain = 14) trained on enhancers and promoters. E Human–mouse interspecies DHS prediction accuracy (AUROC) across various cell tissues (avg. between mouse element prediction using model trained on human elements and vice versa). F Schematic of enhancer mapping and conservation level of human fibroblast enhancer/promoter regulatory vocabularies (gkm-SVM interspecies enhancer prediction AUROC) and enhancer/promoter CREs (fraction of conserved enhancers mappable by LiftOver) in chimpanzee, gorilla, orangutan, rhesus, and mouse. G Distribution of DNase accessibility in human fibroblast enhancers and promoters (top 10,000 in each DHS class by total DNase-seq reads mapped) and distribution of DNase signal at primate and mouse loci mapped from human enhancers and promoters. Signals are normalized as fold changes from average fibroblast enhancers and promoter accessibilities in respective species (top 10,000 in each DHS class by DNase-seq read mapped). Boxplots represent quartiles and median, while the green dots represent the mean. Source data are provided as a Source Data file.We observed distinct biochemical signatures in these four element classes. Distal cell-specific DHSs show strong markers of enhancer activity such as ChIP-seq signal for H3K4me1 and lack markers of promoter activity (POLR2A, H3K4me3) (Fig. 1B, Supplementary Figs. 1, 2) (section “Methods”; ENCODE experiment file IDs listed in the Supplementary Data). By contrast, proximal constitutive DHSs have the highest level of chromatin accessibility among the four classes and display clear signatures of promoter activity (POLR2A, H3K4me3) and depletion of enhancer marks (H3K4me1). These classification criteria allowed us to robustly define enhancer elements without the need for diverse histone ChIP-seq experiments, which are currently unavailable for many biosamples assayed with the DNase-seq experiments. Further, many distal constitutive DHSs appear to be CTCF binding sites (a known regulator of chromatin topological organization), and proximal cell-specific DHSs show mixed signatures of enhancers and promoters, which emphasizes the utility of the two criteria (TSS distance; cell-specificity of chromatin accessibility) for precisely defining enhancer elements (Fig. 1A, B). For the rest of the paper, we will refer to distal cell-specific DHSs as enhancers and proximal constitutive DHSs as promoters.Enhancer regulatory vocabularies, obtained through gkm-SVM training on enhancers, are cell-specific. Gkm-SVM is a sequence-based machine learning method that effectively distinguishes enhancers from inactive genomic elements by learning the weighted combination of gapped-kmers that predicts enhancers24. Gkm-SVM training assigns higher weights to gapped-kmers enriched in enhancers, and this information can be mapped to kmer weights, where kmers comprised of predictive gapped-kmers—or enhancer vocabularies—are assigned higher weights. The biological relevance of enhancer regulatory vocabularies has been demonstrated by their utility in predicting functional impacts of enhancer sequence variants25,26,28,29,30. All human and mouse gkm-SVM models used in this study are publicly available in the ENCODE portal (encodeproject.org), and their aliases and experimental inputs are listed in the Supplementary Data. Supplementary Fig. 3 describes how to access the gkm-SVM models through the portal.As enhancer kmer-weight vectors encode DNA binding motifs of core TF regulators, they can be used to quantify how similar enhancer regulatory vocabularies are by cell/tissue and across species. For example, gkm-SVM models trained on human and mouse adult brain enhancers detect the same TFBS, with highly similar enhancer regulatory vocabularies (Fig. 1C; R = 0.74). One of the top predictive kmers (CACCAGATGGC) in both human and mouse brains is a TFBS for neurogenic TFs NEUROG1/2 and ATOH131,32,33. Another predictive kmer for both human and mouse brains (CCATGGCAACC) is bound by RFX family TFs, of which RFX3/RFX5/RFX7 are highly expressed in the human brain34, and their mutations are linked to intellectual and behavioral abnormalities35. On the other hand, kmers associated with TFs highly expressed in non-neural cell/tissues, such as AAAGAGGAAGT (SPI family; blood cell development36) and GGACTTTGACC (HNF4A; liver/pancreas37), have low weights in both human and mouse brain enhancer models. Cell/tissues of distinct identity have low enhancer kmer weight vector similarity (Supplementary Fig. 4; Rbrain vs monocyte, Rspinal cord vs macrophage, Rbrain vs liver = 0.085, 0.11, 0.19). Across a wider range of cell/tissues, biosamples of the same cell/tissue identity consistently show high kmer-weight correlation (mean R = 0.72) while pairs of kmer weights from distinct cell/tissues show lower correlation (mean R = 0.31) (Fig. 1D). By contrast, kmer weights obtained from promoters show low cell-specificity, having high kmer-weight correlation for all pairs of cell/tissues (same tissue mean: 0.80; distinct tissue mean: 0.75). This is consistent with past observations that enhancers are bound by TFs with cell-specific expression while promoter-binding TFs are relatively less cell-specific38.In addition to weight vector similarity, conservation of enhancer regulatory vocabulary can be demonstrated by showing that models trained on human can predict enhancers in mouse, and vice versa. We trained gkm-SVM enhancer models for a set of human and mouse cell/tissues (brain, retina, B-cell, T-cell, intestine, stomach), and scored enhancers in the other species (i.e., train on human enhancers, predict on mouse enhancers & train on mouse, predict on human, and report the average AUROC) (Fig. 1E, section “Methods”). For matched cell/tissues, the accuracy of interspecies enhancer prediction is high (N = 6; mean AUROC = 0.91), while the prediction accuracy between distinct cell/tissues is low (N = 30; mean AUROC = 0.65). We verified that the high interspecies enhancer prediction accuracies for matched cell/tissues are not inflated through potential train–test set leakage from sequence conservation by showing that two train–test splitting methods, random and syntenic splits (designed to prevent leakage through orthology), show no difference in resulting AUROCs (section “Methods”; Supplementary Fig. 5C). On the other hand, interspecies prediction accuracies for promoters are high for both mismatched (e.g., human brain & mouse B-cell; N = 30; mean AUROC = 0.93) and matched cell/tissues (N = 6; mean AUROC = 0.94). These results are consistent with observations that expression of core TFs39,40 and their DNA binding affinities17 are well conserved between human and mouse, and with previous studies showing that machine learning models trained on human can predict enhancers in mouse41,42.Although enhancer vocabularies are highly conserved, this observation is insufficient to map which specific orthologous enhancers are conserved between the two species, as all enhancers receive similarly high model scores. In addition, the evaluation performed above requires enhancer sets in both species. To clearly demonstrate the challenge of mapping evolutionarily related enhancers, we will first use a conventional mapping method (LASTZ/LiftOver) and compute the conservation rate of human fibroblast enhancers and promoters using a set of DHS data in fibroblasts from diverse mammals43 (chimpanzee, gorilla, orangutan, rhesus, and mouse10, in increasing divergence from human) (section “Methods”; Fig. 1F). To assess conservation rate, we compute the fraction of enhancers in one species that map to chromatin-accessible elements in the other species, and average the two reciprocal directions. In spite of the conserved regulatory vocabulary, about 40% of enhancers are conserved between humans and chimpanzees, and this value rapidly decreases to 11% in human and mouse as evolutionary distance to human increases (72% decrease; Fig. 1F). Promoter conservation rate also decreases from 89% between human and chimpanzee to 65% between human and mouse (but at a slower rate, 27% decrease). In contrast, regulatory vocabularies of both enhancers and promoters as quantified by gkm-SVM are constant across all species (Fig. 1F, human–mouse enhancer AUROC: 0.86; promoter: 0.89; human–chimpanzee enhancer AUROC: 0.85; promoter: 0.92). As an alternative metric, we can count the DNase read signal in each species’ orthologous loci relative to average elements in that class, as shown in Fig. 1G for human fibroblast enhancers and promoters, confirming the rapid reduction in enhancer conservation rate. Orthologous chimpanzee loci mapped from human enhancers and promoters respectively are on average 66% and 98% as accessible as average chimpanzee enhancers and promoters. These signals decrease dramatically to 21% (enhancer) and 86% (promoter) when we map human to mouse, further underscoring the rapid evolution of enhancers.We used enhancer regulatory vocabulary to identify the most similar pairs of human and mouse cell/tissues for further quantitative assessment of conservation (Fig. 2). We generated a comprehensive list of 45 human/mouse cell/tissue pairs matched by gkm-SVM enhancer vocabularies (Supplementary Data; section “Methods”) and show that the cell-specific regulatory vocabulary is conserved across all 45 cell/tissues (Supplementary Fig. 6). We will also use this set of 45 pairs of cell/tissues to evaluate our novel genome alignment method, gkm-align, against conventional methods (Figs. 3, 4).Fig. 2: Enhancer conservation levels are highly variable across cell types.A Human–mouse enhancer (distal cell-specific DHS) and promoter (proximal constitutive DHS) conservation rates (mappability to DHS by LASTZ/LiftOver genome alignment/mapping algorithms) in orthologous syntenic intergenic loci. Mean of human-to-mouse and mouse-to-human mappings across the 45 human–mouse cell–tissue pairs (A: adult, E: embryonic); sorted by enhancer mappability (identical orders for H, I). B Human–mouse conservation rate for enhancers, promoters, and CTCF ChIP-seq peaks for B-cell, brain, ESC, and heart. C Human–mouse enhancer conservation rate by alignment (mappability by LASTZ/LiftOver) vs human–mouse enhancer regulatory vocabulary conservation rate (correlation of gkm-SVM kmer weights) for pairs of orthologous gkm-SVM matched tissues (e.g., human and mouse brain; N = 45) and mismatched cell/tissues (e.g., human T-cell and mouse brain; N = 1980) (R = 0.78) D Schematics of human–mouse syntenic intergenic loci (HE human enhancer, ME mouse enhancer, HG human gene, MG mouse gene); comparison of human and mouse enhancer numbers in matched syntenic intergenic loci (N = 12,704) for embryonic brain (syntenic enhancer number constraint R = 0.91) and adult liver (R = 0.67) E Syntenic enhancer number constraint vs human–mouse enhancer conservation rate (N = 45, R = 0.78; linear regression line and 95% confidence interval). F Schematic describing how orthologous and paralogous enhancers are defined (HE human enhancer, ME mouse enhancer, G gene; line indicates sequence homology); dots represent 5000 human enhancers w/ highest DNase signal. Gapped-kmer sequence similarity with top-matched mouse enhancers vs with top-matched human enhancers. Enhancers in red-shaded regions are classified as paralogous enhancers; enhancers in blue-shaded regions are classified as orthologous enhancers. G Ratio of orthologous to paralogous enhancers across the 45 cell/tissue pairs (avg. of human and mouse) vs human–mouse enhancer conservation rate (by genome alignment) H Total fraction of enhancer base pairs annotated by each class of transposable elements (avg. of human and mouse). I Fraction of conserved enhancers and overlap with TEs (repeat = enhancers with more than 50% base pair overlap with type I TEs; non-repeat = enhancers with zero overlap with any repeat annotations). Source data are provided as a Source Data file.Fig. 3: gkm-align algorithm identifies conserved enhancers by finding the optimal alignment path of maximum gapped-kmer similarity.A Sequence similarity between a mouse (M) enhancer and a human (H) enhancer is quantified by their similarity in gapped-kmer compositions (gapped-kmer similarity, or gkm-sim). B Schematic describing computation of pairwise gkm-similarity of all pairs of sliding windows in syntenic genomic loci of the two species. The pairwise similarity values are encoded in a gapped-kmer similarity matrix G. C Schematic describing how whole-genome alignment is performed using the GNA12 inversion locus as an example (dots: short sequence matches; colors: groups of short matches in syntenic blocks; boxes: pairs of human/mouse syntenic loci from which gkm-similarity matrices derive). Visualization of gapped-kmer similarity matrix (G) in FADS syntenic locus D without gkm-SVM repeat masking and E with masking. F Identification of colinear series of conserved elements using matrix G. G Alignment of the HBB Locus Control Region (dot size: gkm-similarity; color: gkm-SVM prediction score at corresponding human locus using gkm-SVM model trained on mouse embryonic liver enhancers; highlights: CREs); HE: human element; ME: mouse element. H Average DNase accessibility and H3K27ac/H3K4me1/H3K4me3 histone ChIP-seq signals at mouse loci mapped from human enhancers (aggregated across nine distinct cell/tissues) using gkm-align. Signals are normalized as fold change from genomic average. I Number of human–mouse conserved enhancers that are identifiable uniquely by LASTZ/LiftOver (x-axis) and gkm-align (y-axis) for each of the 45 cell/tissue pairs. Gkm-align identifies conserved enhancers missed by LASTZ/LiftOver in all tissues. Relative to LASTZ/LiftOver, gkm-align discovers 197 novel enhancers on average per cell/tissue and total 6591 novel enhancers across all 45 cell/tissues. Source data are provided as a Source Data file.Fig. 4: gkm-align identifies more novel conserved enhancers and robustly predicts functional conservation when combined with cell-specific information.A Schematic describing how cell-specific gkm-SVM enhancer prediction model is incorporated into gkm-align for cell-specific weighted alignment. B Number of B-cell human enhancers mappable to mouse B-cell enhancers using LASTZ/ LiftOver (gray dashed line) and gkm-align weighted by gkm-SVM enhancer models trained on B-cell (red), myeloid progenitor cell (green), and thymus enhancers (blue) with varying weights. C Percent change in the number of identifiable human/mouse enhancer pairs by gkm-align, relative to LASTZ/LiftOver, across 45 cell–tissue pairs using cell-type matched gkm-SVM model weighting. Boxes show quartiles, with a line at the median; whiskers extend up to 1.5 times the interquartile range. Line plots for a subset of cell/tissues. Dotted line: same but using random syntenic mapping instead of gkm-align. D Number of human–mouse conserved enhancers that are identifiable uniquely by LASTZ/ LiftOver (x-axis) and gkm-SVM weighted cell-specific gkm-align (y-axis) for each of the 45 cell/tissue pairs. Relative to LASTZ/LiftOver, weighted gkm-align discovers 610 novel enhancers on average per cell/tissue and total 23,660 novel enhancers across all 45 cell/tissues (using the most optimal c for each tissue). E Schematic describing regression model for ranking enhancer mapping to mouse (mDNase) in terms of human enhancer features (qDNase = DNase signal at query human enhancer; gkm-sim = gapped-kmer sequence similarity between human and mouse element; PH(M) = human gkm-SVM score of mapped mouse element). F Predicting DNase-seq signal at mouse loci mapped from human enhancers (mDNase) using combinations of features described in (E) across the 45 human/mouse cell/tissue pairs. Boxes show quartiles, with a line at the median; whiskers extend up to 1.5 times the interquartile range. Source data are provided as a Source Data file.Enhancer conservation is highly variable across cell typesIn the 45 pairs of human and mouse cell/tissues, we observed intriguingly high cell/tissue-specific variability in enhancer conservation. As in Fig. 1F, we defined the conservation rate of human enhancers as the fraction of human enhancers that map to mouse DHS of the matched cell/tissue (e.g., human brain/mouse brain), constraining the mapping by LASTZ/LiftOver to syntenic intergenic loci (section “Methods”). Conservation rates of mouse enhancers were defined similarly, and human–mouse conservation levels were defined as the average of the two reciprocal directions. Promoters showed consistently high conservation rate across the 45 tissues (mean 67%) while enhancers showed a highly variable conservation rate, ranging from 6.7% (embryonic liver) to 31% (embryonic brain) (Fig. 2A). This cell-specific pattern of conservation persists even when we limit our quantification to enhancers with the highest DNase signals (Supplementary Fig. 7). Such lack of conservation is not observed in CTCF ChIP-seq peaks, which are also often distal to TSSs (conservation rate for brain, heart, B-cell, ESC = 82%, 79%, 72%, 76%; Fig. 2B), suggesting that many CTCF loops and topologically associated domains are conserved16,44,45,46. The strong and somewhat counter-intuitive tissue specificity of enhancer conservation will be explored extensively below.We observed low enhancer conservation rates in some tissues in spite of the fact that their regulatory vocabulary is highly similar between human and mouse. To explore this systematically, we computed the similarity of enhancer regulatory vocabularies (measured as Pearson Corr. of enhancer kmer-weight vectors), for both matched and mismatched pairs of human and mouse cell/tissue pairs (e.g., matched: human brain & mouse brain; mismatched: human brain & mouse muscle; Nmatched = 45; Nmismatched = 452 − 45 = 1980; Fig. 2C). Overall, conservation rates of enhancers and similarity of enhancer regulatory vocabularies correlate highly (R = 0.78, Fig. 2C), indicating that human and mouse cell/tissues that share similar sets of core TF regulators also tend to share a higher number of orthologous enhancers. While enhancer regulatory vocabularies were overall highly similar for all matched human/mouse cell/tissue pairs (relative to the mismatched pairs), interestingly, their enhancer conservation rates varied widely (vertical spread of black points, Fig. 2C). For example, the human/mouse adult liver pair had an almost identical similarity of enhancer regulatory vocabulary (R = 0.72) as human/mouse embryonic brains (R = 0.71), but the adult liver enhancer conservation rate was 9.4% (<1/3 of the embryonic brain), even lower than the enhancer conservation rate between a mismatched pair of human embryonic muscle and mouse embryonic brain (12.7%). This implies that some cell/tissues, with conserved core TF regulators and DNA binding specificities, have experienced more incidence of enhancer turnover than other cell/tissues.To eliminate the possibility that the highly cell/tissue-specific rate of enhancer conservation is a bias of the LASTZ/LiftOver alignment/mapping algorithms, we performed an orthogonal analysis. Using 12,455 syntenic intergenic loci of human and mouse derived from 15,500 orthologous protein-coding genes22 (Supplementary Data), we simply counted the number of human and mouse enhancers located in each of the matched syntenic intergenic loci (Fig. 2D; section “Methods”). We compare the correlation between the number of human and mouse enhancers in respective syntenic intergenic loci, which we will refer to as “syntenic enhancer number constraint,” and it imposes an upper limit for mappability of human/mouse enhancers in the matched syntenic loci. If the number of enhancers in syntenic intervals is not conserved, there is no way the enhancers can be conserved at the sequence level unless they arose through duplication. Embryonic human/mouse brain, which had the highest rate of enhancer conservation, also showed the highest level of syntenic enhancer number constraint (R = 0.93; Fig. 2D), while adult human/mouse liver had lower syntenic enhancer number constraint (R = 0.67), with occasionally drastically different numbers of enhancers in matched syntenic intergenic loci. This is consistent with reports of species-specific rewiring of transcription in the liver47,48 and a relatively slower rate of transcriptomic divergence of the brain across mammals49. The lack of syntenic enhancer number constraint appears in cell/tissues with low enhancer conservation level (Fig. 2E), and hints that the lower rate of conserved enhancers in some cell/tissues, as predicted by genome alignment, is an inherent property of the regulatory landscape.This apparent lack of syntenic enhancer number constraint is largely driven by species-specific enhancer duplication, where cell/tissues with lower level of enhancer conservation tend to have a higher proportion of paralogous enhancers (see section “Methods”). To estimate the proportion of orthologous and paralogous enhancers in each human cell/tissue (in the context of human–mouse common ancestry), we labeled a human enhancer as paralogous if it has high sequence homology with another human enhancer but lacks sequence homology with any mouse enhancer, and similarly labeled it as orthologous if it has high sequence homology with a mouse enhancer but lacks homology with any other human enhancers (Fig. 2F; section “Methods”). Based on this criterion, of the 5000 human embryonic brain enhancers with the highest DNase I accessibility, 873 and 100 were identified as orthologous and paralogous enhancers (ratio: 8.73). In contrast, the adult liver had 320 and 905 orthologous and paralogous enhancers (ratio: 0.35). These estimated ratios of orthologous to paralogous enhancers, averaged between human and mouse, closely matched with the syntenic enhancer number constraint and with enhancer conservation rate for the 45 cell/tissue pairs (Fig. 2G; Supplementary Fig. 8), indicating that enhancer duplication events have been a significant contributor to the divergence in enhancer landscapes.These duplication events are largely driven by TE. The paralogous enhancers show significant enrichment of LTR retrotransposons across diverse cell/tissues (Supplementary Fig. 9; section “Methods”), and we observed a general trend that the cell/tissue pairs with low enhancer conservation level tend to have high enrichment of TE (Fig. 2H—bars in the same order as Fig. 2A; Supplementary Fig. 15A; R = −0.88). Quantifying TE enrichment as the fraction of total enhancer base pairs that overlap with a TE, we observed that embryonic brain enhancers, averaged between human and mouse, had <10% enrichment of type I TE (LTR/LINE/SINE) while some cell/tissues, such as ESC, had TE enrichment as high as 33%. Overall, TE enrichment in enhancers appears lower than its genome-wide coverage (>40%), with SINE elements especially depleted in enhancers of most cell/tissues (Supplementary Fig. 10). However, interestingly LTR elements appear to be highly enriched in enhancers across multiple cell/tissues, and their enrichment grows in enhancers with the strongest DNase I accessibility (Supplementary Fig. 11). Although we do observe clear signals of DNase I accessibility in LINE elements (at its 5′ end) for multiple cell/tissues (Supplementary Fig. 13), these generate weak DNase I peaks, and LINE elements are depleted in enhancers of most cell/tissues (Supplementary Fig. 12). Like LTR, LINE enrichment increases with increasing DNase I accessibility, surpassing the genomic average for subsets of top 1000 enhancers of human colon and ESC with the highest level of DNase I accessibility (Supplementary Fig. 12). By contrast, SINE elements are more depleted in enhancers with higher DNase I accessibility (Supplementary Fig. 10). These observations are consistent with a recent report that also found significantly higher enrichment of LTR than SINE in distal enhancers50. This TE-specific and cell/tissue-specific variation in TE-enhancer association suggests that TEs may have a functional role in shaping the enhancer landscape51,52,53,54,55, but it is difficult to separate function from their naturally increased tendency to transpose into accessible regions.Most enhancers with overlapping TE annotations are species-specific (Fig. 2I), and the increase in TE enrichment in enhancers explains much of the decrease in enhancer conservation across cell/tissues. This is not limited to weaker DHS, and in fact TE enrichment in enhancers grows with DNase I accessibility. We show this by repeating the analysis of Fig. 2I using only the top 1000 enhancers with the highest DNase signal (Supplementary Fig. 16). In summary, enrichment of TE-associated enhancers contributes heavily to the observed cell/tissue-dependent variability in enhancer conservation (orange bars in Fig. 2I and Supplementary Fig. 16 grow with decreasing conservation). However, intriguingly, we also observe a decreasing trend of conservation in non-TE-associated enhancers as TE enrichment in cell/tissues increases (gray bars in Fig. 2I and Supplementary Fig. 16 shrink with decreasing conservation; Supplementary Fig. 15B; R = −0.84). We find it curious that the increased enhancer duplication driven by TEs is correlated with the reduction in the conservation level of non-TE-associated enhancers. It is possible that TE-driven functional redundancy allows more rapid evolutionary turnover in these tissues.
gkm-align algorithm finds the alignment path of maximum gapped-kmer similarityIdentification of conserved enhancers in evolutionarily distant mammals, such as mice, is made difficult by their rapid evolution, and limitations of genome alignment algorithms may underestimate conservation. We do not know how significantly LASTZ/LiftOver alignment inaccuracy contributes to the low rates of conservation shown in Fig. 2. To address this issue, we next present a novel genome alignment algorithm that differs from previous methods by using gapped-kmer sequence features that more readily capture the functional elements of enhancer sequences.Enhancers contain degenerate clusters of TFBS, and enhancer mutagenesis studies have shown that enhancer function is strongly affected by mutations within binding sites and robust to mutations between binding sites30,56. This modular architecture more readily tolerates insertions/deletions between TFBS and small structural variations. To exploit this modular structure, gkm-align uses a sequence similarity metric that compares a pair of sequences (e.g., width of 300 base pairs) by their gapped-kmer composition (Fig. 3A; Supplementary Notes 1–5 for algorithmic details). Alternative sequence similarity metrics based on kmer composition have previously been used to detect evolutionarily related sequences57,58,59 (typically using 6-mers), but these methods have not been applied to whole-genome alignments. We chose to generate whole-genome alignments using gapped-kmers because they more accurately model flexible combinations of TF-binding sites24,25,28. Gapped-kmers contain a fixed number of gaps, which represent any nucleotide, and compactly model degenerate positions in TFBS. There exist \(\left({ l \atop k }\right)4^{k}\) gapped-kmers with size l and k non-gapped positions (e.g., N = 5,406,720 for l, k = 11, 7), and the gapped-kmer similarity (gkm-sim) for a pair of sequences is computed as the cosine similarity of these vectors, each encoding the counts of gapped-kmers in the respective sequence.To align a pair of human/mouse loci, we first compute a gkm-similarity matrix (G) of all pairs of sliding window subsequences of the human and mouse loci (Fig. 3B; Supplementary Note 2). The size of this matrix will depend on the locus size; for example, human/mouse loci of 20 kilobases (e.g., FADS loci) have ~1000 subsequences of 300 base pair windows sliding by 20 base pairs, and the gkm-similarity matrix (G) of dimension 1000 × 1000 encodes all pairwise gkm-similarities of the human and mouse subsequences (as shown in Fig. 3D). Exons of matched orthologous genes in human and mouse show highest levels of sequence similarity, but regions of low complexity (e.g., tandem repeats) also show high interspecies sequence similarity due to their prevalence and uniform sequence composition (identifiable as a row of horizontal dots in Fig. 3D). To remove these repetitive sequence matches, we train gkm-SVM to detect and mask sequence patterns that are ubiquitous across the human and mouse genomes (Fig. 3E; Supplementary Fig. 14; Supplementary Note 5). We observed that the optimal masking threshold that maximizes the mapping rate from human enhancers to mouse enhancers masks about 10% of the human and mouse genomes (Supplementary Fig. 17). We then compute G using the masked sequences, to which we apply a variant of Smith-Waterman algorithm to identify an optimal alignment path that encodes how human/mouse loci diverged (Fig. 3F; Supplementary Note 3). This method of alignment is extended genome-wide by utilizing orthologous gene annotations22 and short sequence matches19,60 (Fig. 3C; Supplementary Note 4). The gkm-align software package can be downloaded from https://github.com/oh-jinwoo94/gkm-align.We next demonstrate the gkm-align algorithm at the well-studied hemoglobin beta (HBB) locus control region (LCR)61. These loci in human and mouse each contain 4–5 enhancers, and the human enhancers have shown to be capable of regulating mouse HBB expression through transgenic mouse experiments62. We aligned the HBB LCRs of human and mouse using gkm-align and mapped the five mouse enhancers to human (Fig. 3G). The five mouse enhancers (labeled as ME1, …, ME5) all have strong DNase I accessibility and EP300 binding in mouse embryonic liver (where HBB is active) and are bound by GATA163 in erythroblasts; however, only four of the five human loci mapped from these mouse enhancers (labeled as HE1, …, HE5) show strong marks of enhancers. HE5 has weak GATA1 binding in erythroblast and low DNase I accessibility in embryonic liver. Further, inhibiting HE5 with CRISPRi has the weakest effect on downregulating HBE1 expression among the five putative human enhancers mapped from mouse64,65 (Supplementary Fig. 18; section “Methods”). It is likely that HE5 may have accumulated mutations leading to loss of regulatory activity. This loss of regulatory activity is also predictable by gapped-kmer sequence similarity metrics. As will be justified further below, Supplementary Fig. 18 shows that the geometric mean of gapped-kmer sequence similarity and interspecies gkm-SVM enhancer prediction is consistent with the CRISPRi effect at HBB enhancers.We applied gkm-align genome wide across the 45 cell/tissue pairs and identified many novel conserved enhancers which are not predicted by LASTZ/LiftOver (either predicted to be deleted or which LiftOver maps to inactive regions). Overall, the gkm-align predicted mouse enhancers show clear marks of enhancer activity (Fig. 3H): strong DNase/H3K27ac/H3K4me1 signals and weaker H3K4me3 (averaged across nine cell/tissues for which the mouse histone ChIP-seq data are available; the cell/tissue identities are shown in the Supplementary Data). For all the 45 cell/tissue pairs, gkm-align mapped a higher number of human enhancers to mouse enhancers than LASTZ/LiftOver, with the increase in enhancer mappability ranging from 1% (embryonic limb) to 22% (hematopoietic multipotent progenitor cells) (Figs. 3I and 4C). For the cell/tissue pair of human hippocampus astrocyte and mouse Müller cells (both of which are glial cells), gkm-align successfully mapped 791 human enhancers to mouse enhancers, which are incorrectly mapped by LASTZ/LiftOver (either deleted or mapped to inactive mouse regions). These novel conserved enhancers show clear markers of enhancer activity (Supplementary Fig. 19). Conversely, only 222 human enhancers were correctly mappable uniquely by LASTZ/LiftOver but incorrectly mappable by gkm-align. In total, 8559 human glial enhancers were mapped to mouse enhancers by both methods. We estimate that only 1.68% of these conserved enhancer mappings have resulted by chance (Supplementary Fig. 20; section “Methods”). Together across the 45 cell/tissues, gkm-align identified 6591 novel conserved enhancers. This greatly increases the number of human enhancers which can be functionally tested for disease relevance in mouse models.
gkm-align identifies additional novel conserved enhancers when combined with cell-specific vocabularyAlthough gkm-align outperforms LASTZ/LiftOver, the sequence similarity metric does not explicitly make use of cell/tissue-specific regulatory vocabulary. Gkm-SVM enhancer regulatory vocabularies, encoding TFBS motifs, are well conserved between human and mouse (Fig. 1E, F), and they can be incorporated into gkm-align both to improve discovery of conserved enhancers and to quantify their predicted functional conservation. This additional information leads to an expanded catalog of human enhancers testable through mouse models, ranked by the likelihood of conserved regulatory function.We incorporate cell-specific gkm-SVM regulatory vocabularies into gkm-align following a simple and intuitive model: if a pair of human and mouse enhancers, denoted as HE and ME, are orthologous, then they should have similar DNA compositions (i.e., general sequence conservation), and should both contain conserved TFBS motifs relevant to the shared cellular context (i.e., functional sequence conservation) (Fig. 4A). General sequence conservation (G) is quantified using gkm-similarity, as previously described. Functional sequence conservation (F) is computed using interspecies gkm-SVM prediction scores, which we normalize to vary between 0 and 1 for interpretability (Supplementary Note 5). Denoting \({P}_{H}({ME})\) as normalized prediction score of a mouse element by a gkm-SVM model trained on human enhancers, we can interpret \({P}_{H}({ME})\) as the probability that ME can function as an enhancer in the orthologous human cellular context. \({P}_{M}\left({HE}\right)\) is defined similarly (M: mouse; HE: human enhancer). Then, functional sequence conservation, computed as \(F={P}_{H}({ME})\cdot \,{P}_{M}({HE})\), can be interpreted as the probability that ME and HE can both function as enhancers interchangeably in human and mouse cellular contexts. For cell-specific weighted alignment, we combine the two measures of enhancer conservation into \({G}_{c}={G}^{1-c}\cdot {F}^{c}\,(0\le c\le 1)\), which adjusts the alignment path toward human/mouse element pairs with both similar sequence composition (G) and functional similarity in a common cellular context (F). As expected, the resulting alignments increasingly diverge from the cell-independent (c = 0) alignment paths as the degree of cell-specific weighting (\(c\)) increases (Supplementary Fig. 21A), but we also observe that enhancer mappings are largely consistent among different cell-type specific models (Supplementary Fig. 21B).Cell-specific weighted alignment by gkm-align identifies the highest number of conserved enhancers when it is combined with gkm-SVM enhancer prediction model trained on enhancers of relevant cell/tissue type. For example, LASTZ/LiftOver and gkm-align each identify 1325 and 1529 conserved human B-cell enhancers, but if gkm-align is combined with B-cell trained gkm-SVM enhancer prediction models, the number of identifiable conserved enhancers increases up to 1818 at cell-specific enhancer model weighting parameter (c = 0.8) (a 37% increase from LASTZ/LiftOver) (Fig. 4B). The identification rate also increases when gkm-align is combined with gkm-SVM models of similar cell types (with overlapping TFs), such as myeloid progenitor cells and thymus, each with peaks at 1663 (c = 0.5) and 1605 (c = 0.4) conserved enhancers. Similarly, for colon enhancers, LASTZ/LiftOver, unweighted gkm-align (c = 0), and cell-specific gkm-align (c = 0.75) each identify 1125, 1221, and 1352 conserved enhancers, which corresponds to a 7.9% and 20% increase over LASTZ/LiftOver for unweighted (c = 0) and weighted (c = 0.75) gkm-align, respectively (Supplementary Fig. 22). As expected, such improvement is smaller when gkm-align is combined with less relevant gkm-SVM models (e.g., trained on proximal DHS; Supplementary Fig. 23). Cell-specific weighting using enhancer-trained gkm-SVM models improves the identification rate of conserved enhancers for all pairs of 45 cell/tissues (Fig. 4C). At c = 0.9, we observe up to an 80% increase in conserved enhancer discovery over LASTZ/LiftOver for monocytes, with 16 cell/tissues with >20% increase for c = 0.7 and c = 0.8. A subset of cell/tissues exhibited limited improvement through gkm-SVM weighting (e.g., brain), but their identification rates remained higher than both unweighted gkm-align and LASTZ/LiftOver at c = 0.5. Across the 45 cell/tissues, weighted cell-specific gkm-align discovers several hundred novel enhancers in every tissue and 23,660 total novel conserved enhancers across all 45 cell/tissues (Fig. 4D). We confirmed that such improvement using gkm-SVM interspecies enhancer prediction is not a result of train–test set leakage from sequence conservation (Supplementary Fig. 24; section “Methods”; consistent with Supplementary Fig. 5C), and further confirmed that using gapped-kmers (l = 11, k = 7) instead of kmers (l = 6, k = 6) for computing gapped-kmer similarity and for gkm-SVM weighting leads to higher conserved enhancer mapping (Supplementary Fig. 25; consistent with Supplementary Fig. 5D). Despite the increased performance by gkm-align with or without gkm-SVM cell-specific weighting, the cell-specific pattern of enhancer conservation we showed in Fig. 2A is maintained (Supplementary Fig. 26). Lastly, it should be noted that, although cell-specific alignment discovers a higher number of conserved enhancers compared to unweighted gkm-align, there are also handful of enhancers identifiable by unweighted gkm-alignment but missed by cell-specific gkm-alignment (Supplementary Fig. 27). These tend to be conserved enhancers with high general sequence similarities but with more degenerate TFBS. For this reason, both cell-independent and cell-specific gkm-align should be used in parallel to maximize the span of identifiable conserved enhancers, and we provide genome-wide enhancer mappings for both methods.We reproduced the above results using an independent dataset66, where enhancers in diverse non-human mammals in four tissues (brain, muscle, liver, testis) were identified using H3K27ac and H3K4me1 histone marks. The four tissues are also included among our list of 45 cell/tissues, and we evaluated gkm-align by mapping our enhancers (distal cell-specific DHS) in these tissues to the histone-defined enhancers. Consistent with the above results (Fig. 4B, C), gkm-align maps higher numbers of conserved enhancers than LASTZ/LiftOver across four tissues, where gkm-align’s performance maximally improves when tissue-matched gkm-SVM enhancer models are incorporated (Supplementary Fig. 28). Using the same dataset, we also show that gkm-align also maps higher numbers of conserved human enhancers in rhesus macaque for all the four tissues (Supplementary Fig. 29). For this analysis, we used gkm-SVM models trained on our mouse enhancer sets to align human and macaque, which highlights the applicability of gkm-SVM models in different species due to conservation of enhancer vocabularies (Fig. 1E, F).Cell-specific information from gkm-SVM enhancer prediction models can also be combined with the gapped-kmer based sequence similarity metric (gkm-sim) to quantify functional conservation of orthologous enhancer pairs discovered through gkm-align. The strength (DHS signal) of mouse loci mapped from human enhancers tends to correlate with strength of the human enhancers, but a fraction of the identified orthologous pairs lacks such conservation of activity due to sequence divergence (Fig. 1G). To rank predictions, we explored different ways to predict functional conservation between orthologous human and mouse enhancer pairs identified by gkm-align (Fig. 4E). We observed that the DNase signal of a query human sequence (qDNase) was correlated with the DNase signal of the mapped mouse ortholog (mDNase) with median correlation of 0.29 (min: 0.13; max: 0.46), and the product of qDNase and gkm-similarity between query human enhancer (HE) and the mapped mouse ortholog (ME) leads to increased median correlation of 0.41 (min: 0.23; max: 0.60) (Fig. 4F). When this product is further multiplied by enhancer prediction of ME by human-trained gkm-SVM (\(0\le {P}_{H}({ME})\le 1\)), median correlation further improves to 0.46 (min: 0.25; max: 0.61). A mapped mouse ortholog has a high value for this triple product if it exhibits a gapped-kmer composition similar to that of a human enhancer ortholog with high DNase activity and contains conserved TFBS motifs. Training a regression model with these combinatorial features leads to median correlation of 0.55 (min: 0.32; max: 0.68) (Fig. 4F; section “Methods”). All of our human/mouse conserved enhancer mappings are ranked with these gapped-kmer-based conservation scores, which we believe will facilitate downstream experimental testing by providing confidence ranking scores for functional conservation in mice.Many of the novel enhancers identified by gkm-align are supported by additional evidence of conserved function. For example, gkm-align predicts a conserved enhancer in OTU Deubiquitinase 7A (OTUD7A) which is highly expressed in both human and mouse brains67 (Supplementary Fig. 30) and is associated with a wide range of neurological diseases as such schizophrenia and epilepsy68. OTUD7A knockout leads to morphological deformation of cortical neurons and frequent seizure-like events in mice69,70. We identified orthologous pairs of putative OTUD7A enhancers in intron 1 of human and mouse OTUD7A (Fig. 5A, B; enhancers: yellow-highlighted; hg38/chr15:31740273–31740573; mm10/chr7:63554547–63554847). Both human and mouse elements exhibit strong DNase I accessibility and H3K27ac histone modification across biological samples related to the nervous system (Fig. 5A, B). The two enhancers appear to have three clusters of conserved DNA base pairs with high local conservation in gapped-kmer composition (local conservation rate represented as the logo heights in Fig. 5C), and one of the clusters located at the centers of the human and mouse enhancers contains a NEUROG/ATOH1 binding motif (GCAGATGG), which is identified among the top brain enhancer kmer weights for both human and mouse as shown in Fig. 1C. This part of the enhancer has the largest delta-SVM25 score for gkm-SVM models trained on both human and mouse brains (visualized using shaded line plots in Fig. 5C; section “Methods”), indicating that it is a core TF-binding site conserved between human and mouse. Despite the clear conserved biochemical signatures and binding motif, this enhancer is predicted to be deleted in mouse by LASTZ/LiftOver (Supplementary Fig. 31).Fig. 5: Examples of novel enhancers from the expanded catalog of human/mouse orthologous enhancers.A Genome browser visualization of mouse and B human OTUD7A loci. Yellow boxes indicate the identified conserved enhancers. C Visualization of conserved binding sites in human OTUD7A intronic enhancer by sequence conservation with the orthologous mouse enhancer. Logo height represents the local gapped-kmer sequence conservation score with mouse, and line plots indicate TF-binding prediction by delta-SVM models trained on human or mouse brain enhancers. D Visualization of DNase accessibility of orthologous enhancers from the five distinct cell/tissues (green: human, blue: mouse; position relative to DHS center of human/mouse orthologous enhancers; signal: fold change from genomic average).To show further examples of the top conserved enhancers in Fig. 5D, we ranked enhancers that have the strongest combined (i) DNase I accessibility, (ii) gkm-similarity, and (iii) interspecies gkm-SVM prediction, using the regression score described in Fig. 4F. This combined regression score increases the likelihood of functional conservation as shown in Supplementary Fig. 18 for HBB CRISPRi. We ranked enhancers collected from five diverse human and mouse cell/tissues (brain, kidney, stomach, muscle, B-cell), and identified the top 1% conserved enhancers with highest regression score. Among these top orthologous enhancers, we selected a subset of enhancers in the vicinity of orthologous genes with cell/tissue-specific expression67 (Fig. 5D; Supplementary Fig. 30). These genes include KCND3 (brain; voltage-gated potassium channel subunit), PAX2 (kidney; TF associated with renal malformation71), GATA6 (stomach; definite endoderm TF16), MYOD1 (muscle; TF associated with myopathy72), and IKZF3 (B-cell; TF mutated in leukemia73,74). For each of the 45 cell/tissue pairs, we generated a table of ranked orthologous human-to-mouse enhancer pairs. In addition to providing an expanded catalog of conserved distal enhancer elements, the ranking can be used to prioritize elements for functional characterization. We uploaded our catalog of conserved enhancers to beerlab.org/gkmalign/.

Gapped-kmer sequence modeling robustly identifies regulatory vocabularies and distal enhancers conserved between evolutionarily distant mammals

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis