Population variability in X-chromosome inactivation across 10 mammalian species

Reference aligned RNA-sequencing data enables scalable modeling of XCI ratiosWe use bulk RNA-sequencing (RNA-seq) data to measure the X-linked allelic expression of a sampled tissue by computing allele-specific expression ratios of heterozygous single nucleotide polymorphisms (SNPs). The parental proportion of X-linked allelic reads are expected to follow a binomial distribution dependent on the number of sampled reads and the XCI ratio of the tissue (see methods). The binomial distribution is an appropriate model when the parental identity of sequencing reads is known, which is not the case when aligning to a reference genome. A reference genome will contain SNPs from both parents, making the parental identity of aligned reads ambiguous and producing reference allelic expression ratios that represent expression of both parental X-alleles (Fig. 1B).Analogous to folding a book on its side closed, we fold the distribution of reference allelic-expression ratios around 0.50 so that values an equal amount above and below 0.5 are in the same bin. This allows us to aggregate data across both alleles and enable a robust estimate of the XCI ratio magnitude for the bulk RNA-seq sample (Fig. 1B). We fit folded-normal distributions to the reference allelic expression ratios of multiple SNPs per sample, which serves as a continuous approximation of the underlying sequencing depth-dependent mixture of folded-binomial distributions per SNP. The mean of the fitted distribution is the estimate of the XCI ratio for the sample (Fig. 1B). We also incorporate specific steps to address confounding factors that can impact measured X-linked allelic expression, namely excluding SNPs with persistent reference bias across samples and chromosomal bins that exhibit probable escape from XCI40,41 (Supplementary Figs. 1, 2, see methods). Of note, the rat population exhibits a large collection of reference biased SNPs when compared to the other species, likely due to the highly inbred nature of laboratory rat strains. We circumvent this expected issue in the mouse population by leveraging two studies42,43 that sampled Diversity Outbred (DO)44 mice, evidenced by the lack of reference-biased SNPs in the mouse population compared to the other species. Additionally, it is important to note our approach detects SNPs present only within RNA molecules, so we will miss variants in non-transcribed proximal regulatory elements, such as the well-described XCE-interval in mice33. With regards to escape from XCI, we find the strongest signals of escape near chromosomal ends across all species (Supplementary Fig. 2), suggesting escape within pseudo-autosomal regions is conserved across mammals40,45. Previously28, we validated our SNP filtering and XCI modeling approach using phased RNA-seq data (where haplotype information is known for each variant) from the EN-TEx consortium46, achieving nearly perfect agreement in XCI ratio estimates for samples with folded XCI ratios of 0.60 or higher, demonstrating the accuracy of our approach.By calling SNPs from RNA-seq reads and employing folded distributions to model reference-aligned allelic expression, we can estimate the magnitude of XCI in any female mammalian bulk RNA-seq sample. We source female annotated bulk RNA-seq samples of 9 non-human mammalian species from the SRA database (Fig. 1C), additionally including cross-tissue human samples from the GTEx dataset. As sex annotations were not available on SRA for the two DO mouse studies, we annotate the sex of the mouse samples by thresholding on the total number of reads aligned to the Y-chromosome (Supplementary Fig. 3). After processing, the number of samples with a minimum of 10 well-powered SNPs for estimating XCI ratios are 130 macaca (mean of 28 SNPs ±17 SD), 275 horse (mean of 54 SNPs ±36 SD), 291 dog (mean of 29 SNPs ±13 SD), 369 rat (mean of 28 SNPs ±16 SD), 388 mouse (mean of 87 SNPs ±46 SD), 399 goat (mean of 34 SNPs ±14 SD), 654 pig (mean of 50 SNPs ±28 SD), 784 sheep (mean of 81 SNPs ±43 SD), 1364 cow (mean of 33 SNPs ±19 SD), and 4877 human (mean of 56 SNPs ±23 SD, 314 total individuals) samples (Fig. 1C, Supplementary Fig. 1). Aggregating reference SNP allelic expression ratios for samples with similar estimated XCI ratios (0.05 bins) clearly reveals the expected haplotype expression distributions, demonstrating the applicability of folded models (Supplementary Fig. 4). Following sample-level XCI ratio modeling, we then generate population-level distributions by unfolding the distribution of folded XCI ratio estimates around 0.50, analogous to opening a closed book (Fig. 1D).As an additional control to ensure the allelic variability we report from X-linked SNPs is specific to XCI, we estimate autosomal allelic imbalances for all samples using the same pipeline and approach as for the X-chromosome analysis (Supplementary Fig. 5, see methods). Comparing allelic imbalances across the two autosomes closest in size to the X-chromosome reveals the vast majority of samples across all species are biallelically balanced for autosomal expression, as expected (Supplementary Fig. 5). Several species (pig, cow, goat, rat, sheep, and dog) exhibit small subsets of samples that are consistently imbalanced across the two autosomes and the X-chromosome, indicative of a global influence on allelic-expression independent of XCI (Supplementary Fig. 5). These samples with global allelic imbalances are excluded from all downstream analysis, ensuring the population distributions of XCI ratios reflect variability specific to XCI.Models of embryonic stochasticity explain adult population XCI variabilityAfter generating population distributions of XCI ratios for the 10 mammalian species, we next explore how well models of embryonic stochasticity explain the observed adult XCI ratio variability. The initial variability in XCI ratios among mammalian embryos is dependent on the number of cells present during XCI (Fig. 1A), where adult variability can be modeled to infer embryonic cell counts.When estimating embryonic cell counts from XCI variability in adult tissues, it is important to note that adult tissues represent only the embryonic lineage of the blastocyst, not the extra-embryonic lineages. This positions XCI variability of adult tissue samples as informative for the number of cells present within the last common lineage decision for all adult cells, i.e. the number of cells present within the epiblast of the mammalian blastocyst. If XCI occurs after epiblast specification, XCI ratio variability is determined by the number of epiblast cells at the time of XCI. If XCI occurs before epiblast specification, the variability is influenced by both the initial stochasticity of XCI and the stochasticity of cell sampling during epiblast lineage specification. Without cross-tissue sampling of both extra-embryonic and embryonic tissues, the temporal ordering of XCI among these lineage events cannot be resolved. Therefore, estimating cell counts based solely on XCI variability in adult tissues provides an estimate of the number of cells present in the epiblast at the time of XCI.Figure 2A presents the unfolded population distributions of XCI ratios in the 10 mammalian species we sampled, ranging from the least variable (macaca) to most variable (dog). We fit normal distributions as continuous approximations to the underlying binomial distribution that defines the relationship between cell counts and XCI ratio variability (Fig. 1A, D, see methods). We focus on the tails of the distributions for our model fitting (colored in portions of the distributions, unfolded estimates ≤0.40 and ≥0.60, Fig. 2A), for two reasons. Our analysis of autosomal allelic ratios (Supplementary Fig. 5) highlights that samples with no expected allelic imbalance produce folded skew estimates that vary between 0.5 and 0.6 and our previous work28 using phased data indicated model misspecification around the point of folding (0.50). Fitting to the tails of the empirical distribution is therefore a more accurate representation of variability specific to XCI.At a broad level, population XCI ratio variability varies substantially across the sampled mammalian species. Our estimates for the number of epiblast cells present at the time of XCI include 65 (macaca), 31 (rat), 23 (pig), 16 (goat), 15 (horse), 14 (sheep), 14 (cow), 13 (human), 12 (mouse), and 8 (dog) cells, with associated 95% confidence intervals presented in Fig. 2B. Importantly, species with similar numbers of detected SNPs per sample and total sample size exhibit variable cell number estimates, indicating it is unlikely our estimates are driven by technical effects across the species (Supplementary Fig. 6). We additionally down sample all species to the smallest sample size present (130 macaca samples) and achieve virtually identical cell number estimates, demonstrating variation in cell number estimates across species is not driven by sample size differences (Supplementary Fig. 6). The error between the empirical XCI ratio distributions and the normal fitted distributions is strikingly small, with a mean of 0.00588 sum-squared error (±0.00965 SD) across the species (Supplementary Fig. 6). This shows models of embryonic stochasticity can explain observed XCI ratio variability in adult populations exceptionally well.For the least and most variable species (macaca and dog), the estimated autosomal imbalances offer additional context for the reported XCI population variability. The reported X-linked variability in macaca is in excess to the reported autosomal allelic variability, which itself is highly consistent across species (Supplementary Fig. 5). This demonstrates the X-linked population variability for macaca, while strikingly small, still varies beyond the extremely consistent autosomal variability present across species and is specific to the X-chromosome, representing informative variability for estimating cell counts. On the other hand, the dog population is the only one that contains samples with strong allelic imbalances on only one autosome, where autosomal imbalances in all other species are global (Supplementary Fig. 5). This is suggestive of broader genomic incompatibilities within the dog population. The reported X-linked population variability in dog is likely a combination of XCI and broader allelic incompatibilities, positioning our estimate of 8 cells as a likely underestimate due to excess variability outside of XCI.Modeling XCI ratio variability across numerous species allows comparisons in light of evolution for determining generalizable or species-specific characteristics in XCI. Broadly, we demonstrate XCI ratios are variable in each species we assess, revealing variability in XCI ratios itself as a conserved characteristic of XCI. The exact variance in XCI ratios varies across the species, with differences in the timing of XCI and/or differences in cell counts for embryonic/extra-embryonic lineage specification as two putative explanations. We compare our estimated cell counts to the evolutionary relationships among the species we assess (Fig. 2B), suggesting that variability in these early embryonic events can be recent evolutionary adaptations. This is highlighted by the large differences in cell counts between macaca and humans, as well as between rats and mice. When viewed through the lens of cell divisions (log2 of the estimated cell counts, Fig. 2B), the differences in XCI ratio variability among the species can be explained by differences in a range of only 3 cell divisions, a narrow developmental window. This demonstrates even slight changes in the timing of XCI or cell counts for embryonic/extra-embryonic lineage specification across mammalian species can produce large differences in population XCI ratio variability, as explained through the inherent stochasticity of XCI.XCI ratios are not associated with X-linked heterozygosityAfter determining stochastic models can explain population XCI ratio variability across mammalian species, we turn to testing whether we can identify any genetic correlates with XCI ratios. Our approach leveraging natural genetic variation to quantify XCI ratios enables us to assess a large catalog of genetic variants for associations with XCI ratios across mammalian species (10,735 macaca SNPs, 12,024 rat SNPs, 28,339 mouse SNPS, 23,603 pig SNPs, 16,123 goat SNPs, 10,281 horse SNPs, 53,505 sheep SNPs, 18,509 cow SNPs, 16,168 human SNPs, and 10,050 dog SNPs). One putative genetic contribution to XCI ratio variability is allelic selection during development, where increased X-linked heterozygosity (i.e., genetic distance), is more likely to produce selective pressures between the two X-alleles. It follows that samples with higher X-linked heterozygosity would be expected to exhibit more extreme XCI ratios.We score X-linked heterozygosity per sample as the ratio of the detected SNPs within a sample to the number of unique SNPs identified across all samples, relative for each species (Fig. 3A). This quantification also serves as a measure of inbreeding, with decreased heterozygosity associated with a higher degree of inbreeding47. The trend in heterozygosity across species is as expected, with rats (likely laboratory strains) as the most inbred (Fig. 3A). Next, we examine the correlations between sample heterozygosity and the estimated XCI ratio, as well as the estimated allelic variability across SNPs in each sample (mean and standard deviation of the fitted folded-normal distribution per sample, Fig. 3B). Across all species, X-linked heterozygosity shows a near-zero correlation with the estimated XCI ratio, indicating a lack of association between X-linked genetic heterozygosity and XCI ratio variability (Fig. 3B). However, we observe moderate correlations between sample heterozygosity and the estimated variability in SNP allelic ratios in three species: rat (corr: 0.576), macaca (corr: 0.459), and cow (corr: 0.364), notably the most inbred species (Fig. 3A, Supplementary Fig. 7). The increased variability in allelic expression present only within the most inbred species could potentially reflect gene-specific regulatory events between parental haplotypes48 rather than a direct genetic effect on XCI.Low frequency variants exhibit moderate associations with XCI ratiosAfter investigating relationships between genetic variation and XCI ratios at a broad level across the whole X-chromosome, we next asked if individual variants might be associated with extreme XCI ratios. Variants that affect the expression and/or function of the genetic elements that control XCI can result in highly skewed XCI ratios, as documented in human studies15. This can also occur in other X-linked genes, if the resulting differential in gene activity exerts a selective pressure across the X-alleles, as documented in disease cases14,16. We test the association between XCI ratios and individual variants for all variants detected in each species with a minimum of 10 samples, quantified through the area-under-the-receiver-operating-curve statistic (AUROC). For each species, we rank the samples based on their estimated XCI ratio and score the placement of samples carrying a given variant within the ranked list (Fig. 4A). If all the samples with that variant are at the top of the ranked list, the XCI ratio can be said to have perfectly predicted the presence of that variant, quantified with an AUROC of exactly 1. An AUROC of 0.50 indicates the XCI ratio performs no better than random chance for predicting the presence of the variant.The distribution of AUROCs for each species show striking similarities to a null comparison (Fig. 4B, see methods), indicating a pervasive lack of association between XCI ratios and individual variants. However, a small subset of variants in each species exhibits moderate associations (AUROCs ≥0.75 and FDR-corrected p-value ≤0.05). By comparing each variant’s AUROC with its frequency in the species, we find that the variants with moderate associations occur at low frequencies within the sampled populations (Fig. 4C, Supplementary Fig. 8). We investigate whether this relationship is simply due to a lack in power with bootstrap simulations, demonstrating moderate AUROCs (≥0.75) are robust to their small sample sizes (Supplementary Fig. 8). Figure 4D displays these variants along with their gene annotations for each species. Notably, we observe no statistically significant variant-XCI ratio associations in the GTEx human population when performing either a tissue-specific or donor-specific analysis, as well as only considering the sample per donor with the highest sequencing depth (Supplementary Fig. 9). While the GTEx dataset is comprised of thousands of tissue samples, only 314 female individuals are present in our final dataset. We test the effect of a small population size by down sampling the cow data to 300 samples and scoring variant-XCI ratio associations (Supplementary Fig. 9). All of the cow variants that we originally identified as significantly associated with XCI ratios are no longer detected in the down sampled data, in line with the observation that variants with associations to XCI ratios occur at low frequencies within mammalian populations. Increased population sampling is likely required to identify further genetic associations with XCI ratios.Several genes with moderate AUROCs have prior evidence for escaping XCI in humans49, bringing into question their associations with extreme XCI ratios in our analysis. To explore further, we compare the estimated XCI ratios of samples to the allelic ratios of all detected variants for genes with at least one variant significantly associated with XCI ratios. We report several examples across species where the allelic expression of individual variants from these putative escape genes does in fact exhibit the expected balanced biallelic expression of escape from XCI while also being enriched in samples with increased XCI ratios (Supplementary Fig. 9). A gene that escapes XCI will be biallelically expressed; this suggests the variant-specific association we detect within these XCI escapers likely reflects a haplotype-effect, where the variant is linked to a haplotype influencing XCI ratios, rather than an effect from the gene/variant itself. Further analysis with phased data to assess potential haplotype effects may help identify genetic associations with XCI ratios. Overall, our assessments of chromosome-wide genetic variability and individual variants do not reveal genetic associations robust enough to explain population XCI ratio variability across all 10 mammalian species.Putative mouse XCE-Xist-haplotypes exhibit highly variable XCI ratiosOne of the most well-documented instances of a genetic association with XCI ratios is the XCE-haplotypes in laboratory mouse strains, where a preferential ordering of allelic inactivation exists across haplotypes33. The DO mice we utilize are expected to be genetically diverse combinations of various lab strains and it is highly likely a mix of XCE-haplotypes are present within this population and may have an impact on XCI ratios. Such a haplotype-specific effect would be missed in our previous AUROC variant-specific analysis of XCI ratios. Since we only sample variants present within RNA molecules and the XCE-interval is a proximal non-transcribed regulatory element of Xist33, we reason variants present within Xist are likely linked to XCE-haplotypes and may be informative for identifying putative XCE-Xist-haplotypes. We identify 4 putative XCE-Xist-haplotypes as determined by groups of samples with shared Xist variants (Supplementary Fig. 10). As a general observation across haplotypes and the two studies we sample from, XCI ratios of samples with the same haplotype are highly variable (Supplementary Fig. 10), suggesting XCI ratios are not definitively determined by Xist genotypes within the DO mice population. The haplotype with the seemingly largest effect, evidenced by 2 samples with highly skewed XCI ratios (0.799 and 0.782) in the one study that collected striatal tissue, conversely exhibits XCI ratios ranging from balanced to moderately skewed in the second study, which collected pancreatic tissue (Supplementary Fig. 10). While this may be indicative of a tissue-specific effect of a particular XCE-Xist-haplotype, far greater sample sizes with higher genetic resolution to confirm haplotypes are needed for validation. In general, the variability in XCI ratios within putative XCE-Xist-haplotypes suggests non-genetic contributions to XCI variability of DO mice.

Hot Topics

Related Articles