Probabilistic inference of epigenetic age acceleration from cellular dynamics

Statics and reproducibilitySample sizes for model inference and validation were determined by the cohort sizes. Randomization and blinding were not applicable during data collection because the study used observational data from population-based cohorts. No participants were excluded in model training and validation. For each continuous phenotype association study, outlier values that were more than 3.5 × s.d. away from the mean were removed.Randomization and blinding were not applicable during data collection because the study used observational data from population-based cohorts. Statistical tests for model coefficients assumed a normal distribution but this was not formally tested.Generation ScotlandOverviewGS is a family-based cohort consisting of individuals, aged 18–99 years, living across Scotland. Recruitment took place between 2006 and 2011. The cohort encompasses 5,573 families with a median family size of three (interquartile range of 2–5 members; excluding 1,400 singletons without any relatives in the study). The following analyses were conducted using different subsets of Generation Scotland (n = 15,900): GSset1 (n = 4,450) the batch-corrected subset GSset2, (n = 2,578, unrelated to each other and to GSset1 and processed in separate experimental batches) and the batch-corrected subset GSset3 (n = 8,872, processed in separate experimental batches).Full details on the cohort and baseline data collection have been described previously36,37.DNA methylationGenome-wide DNA methylation was measured from blood samples using the Illumina Infinium HumanMethylationEPIC BeadChip at >850,000 CpG sites. The methylation profiling was carried out in three sets, here referred to as set 1, set 2 and set 3. Set 1 consisted of 4,450 unrelated individuals, who were also unrelated to individuals in set 2. Set 2 consisted of 2,578 individuals who were genetically unrelated to each other at a relatedness threshold <0.025. Set 3 consists of 8,872 individuals with some relatedness to sets 1 and 2 (see GWAS section for GRM approach). Poor performing probes, X/Y chromosome probes and participants with unreliable self-report data or potential XXY genotype were excluded. Full details of DNA methylation quality control steps are detailed under Supplementary Information Methods 6.1.GenotypingGS samples were genotyped using the Illumina Human OmniExpressExome-8 v.1.0 and 8 v.1.2 BeadChips and processed using the Illumina Genome Studio software v.2011 (Illumina). Quality control steps are outlined in full under Supplementary Information. Duplicate samples, samples with genotype call rate < 0.98 or outlier values on principal-component analysis of genotype data were excluded. SNPs with a call rate < 0.98, minor allele frequency ≤ 0.01 and Hardy–Weinberg equilibrium test with P ≤ 1 × 10−6 were also removed. Genotype data were imputed using the Haplotype Research Consortium dataset and ~24 million variants were available for analyses38,39. There were 15,871 individuals with genotype and methylation data. Full details of genotyping quality control steps are detailed in Supplementary Information Methods 6.2.Phenotyping and health record linkageGS participants self-reported health and lifestyle data at the study baseline, including lifetime and family history of approximately 20 disease states. Over 98% of GS participants consented to allow access to electronic health records via primary and secondary care records (Readv2 and ICD codes). Data are available prospectively from the time of blood draw, yielding up to 15 years of linkage. Information on mortality and cause of death are updated via linkage to the National Health Service Central Register, which is provided by the National Records of Scotland (data correct as of March 2022).EthicsAll components of GS received ethical approval from the NHS Tayside Committee on Medical Research Ethics (REC reference no. 05/S1401/89). GS has also been granted Research Tissue Bank status by the East of Scotland Research Ethics Service (REC reference no. 20-ES-0021), providing generic ethical approval for a wide range of uses within medical research.Associations of other epigenetic clock CpGs with age and smokingCpG sites were considered naCpGs linear regressions of the form meth ~ age showed a coefficient of determination R2 < 0.1. To compute associations between methylation levels and smoking, we created the ‘weighted smoking’ variable, which combined information on both how much a participant smoked at any point of in its life, \({norm\_smoke}=(1+\mathrm {{pac}{ks}/{year}})\) and the current smoking status, curr_smoke, categorically defined by: (1) currently smoking; (2) stopped within 12 months; (3) stopped more than 12 months ago; or (4) never smoked. More precisely we assumed that the effect of smoking is proportional to norm_smoke and fades exponentially with their smoking status, that is \({weighted\_smoke}={norm\_smoke}/\exp ({curr\_smoke})\). In binary definitions of smoking, considering the non-normal distribution of weighted_smoke, we used a cutoff of 0.25 in weighted_smoke levels, which corresponds to an 85th percentile cutoff from the distribution of weighted_smoke values, to define heavy smokers and nonheavy smokers. When calculating smoking associations for the figures, we used this binary definition of smoking and reported the coefficient of smoking from linear regressions for each CpG of the form: \({meth} \sim {weighted\_smoke}+{age}+{sex}\).Associations of epigenetic clock CpGs with age and alcohol intakeTo assess associations between methylation levels and alcohol use, we transformed the weekly alcohol intake by taking the log of the value plus one. This process allows to recover a normal distribution, given the exponential nature of alcohol use. In binary definitions of alcohol intake, we used a cutoff of an informed mean of 2.3. This number is the mean alcohol intake of participants that reported intake of more than 1 unit per week. We discarded the participants reporting less than 1 unit per week to compute the mean.Associations between naCpGs included in Horvath and tissue of originTo assess the associations between methylation levels and tissue of origin in CpG sites included in Horvath’s clock, we analyzed 3,717 samples from six tissues (cerebellum, breast, liver, kidney, whole blood and saliva) extracted from the EWAS Datahub18.First, we used the combined dataset to compute linear regressions of the form meth ~ age for all CpGs included in Horvath’s epigenetic clock. We found that 80% (n = 288) of all sites included in Horvath (n = 354) showed an r2 < 0.1 and were considered naCpGs.We then checked for correlations between methylation levels and tissue of origin using a one-way analysis of variance. All naCpG sites showed differential methylation levels across tissues (P < 1 × 10−5).Bootstrapping linear regression modelsTo show the decline in associations with increasing training size cohorts, we selected random training cohorts with set sizes and test sets (n = 890 or 20% of GSset1) from GSset1. We trained LASSO models with fivefold cross validation to allow optimization of the selected CpG sites and regularization levels.We then computed acceleration as the residual from the model prediction and chronological age and inferred its association with weight_smoke on the test set.We bootstrapped linear regression models with increasing proportions of smokers (categorically defined as weight_smoke > 0.25) similarly. Here training sets had a fixed size (n = 700) limited by the number of smokers.Variance in methylation levels across age binsTo evaluate the increase in variance of methylation levels with age in GSset1, we first selected all sites for which methylation levels correlated with age (R2 > 0.1). Then we split all participants between 18 and 80 years across five equally spaced age bins (each bin showed an age range of 12.4 years). To promote a fair comparison, and avoid under or over estimation, of variance across bins we randomly selected 294 participants from each bin, the lowest number of participants in any bin, which ensured equal representation of samples in bins.Spearman selection of CpG sitesTo avoid numerical errors during model fitting in GSset1 we discarded CpG sites with NaN methylation values (n = 6) and replaced methylation levels of 0 and 1 by 0.0001 and 0.9999, respectively. Next, we fitted linear regression.We fitted linear regression models of the form meth ~ age on the remaining sites (n = 773,854). To maximize the Spearman correlation, ρ, between methylation changes and chronological age in sites used for model training, and to avoid the presence of naCpG sites in our model, we filtered all CpG sites showing a low coefficient of determination, ρ2 < 0.2. The remaining sites (n = 1,870) were taken forward for model comparison.Biological modeling of methylation dynamics at a single CpG site levelWe now present an overview of the key results necessary to understand the derivation and interpretation of the proposed mathematical model of methylation dynamics. An exhaustive description and precise mathematical derivations can be found in the Supplementary Methods.To model the evolution of methylation dynamics with time, in a single CpG site, we considered a minimal model of transitions between two states. The proposed model directly relates to the early work of Markov in 1905 and was used for the first time to model the dynamics of methylation16 and more recently in the context of aging17. In this model, cells can be in either an unmethylated (U) or methylated (M) state and can transition from one state to the other at rates VU and VM (Fig. 2b).The mean evolution of this system as a function of time, \(t\), is given by$$\mu(t)=\eta +{e}^{-\omega t}(p-\eta ),$$where \(\omega ={\nu }_{\mathrm {U}}+{\nu }_{\mathrm {M}}\) corresponds to the total rate of transitions, \(p\) the proportion of methylation cells at initial time t = 0 and \(\eta ={\nu }_{U}/\omega\) to the proportion transitions associated with a gain in methylation (Fig. 1a). Notice that the definition of speed of epigenetic aging at a single CpG site, \(\omega\), emerges naturally, while the directionality of these changes is independent from the speed, given by the initial state of the system \(p\) and the final state \(\eta\).The evolution of the variance of this system can be similarly described as a function of time,$$\begin{array}{l}{\sigma }^{2}(t)=\frac{{\eta }_{U}{\eta }_{M}}{S}+{e}^{-\omega t}\frac{{{\eta }_{U}}^{2}(1-p)+p{{\eta }_{M}}^{2}-{\eta }_{U}{\eta }_{M}}{S}\\\qquad\qquad+\,{e}^{-2\omega t}\left(\frac{c}{{S}^{2}}-\frac{{{\eta }_{U}}^{2}(1-p)+p{{\eta }_{M}}^{2}}{S}\right),\end{array}$$where S corresponds to the system size and c is the initial variance at the cell level.Although the distribution for the evolution in time of this system is not analytically solvable, it is well approximated by a Beta distribution conditional on the mean and variance derived above. We can therefore model the probability of seeing a methylation level mi in site i in an individual aged t by$$P({m}_{i}{\rm{|}}t)={Bet}{a}_{p.d.f.}({m}_{i},\mu_i(t),\,{\sigma_i }^{2}(t)),$$where Beta p.d.f. denotes the probability density function of a Beta distribution and the mean and variance are adapted to each site. We use this probability definition to find the optimal parameters for each site using all cohort methylation observations. That is, the parameters that maximize$$P({M}_{i})=\prod _{({{m}_{i}}^{j},{t}^{j})\in {M}_{i}}{Bet}{a}_{p.d.f.}({{m}^{j}}_{i},\mu_i({t}^{j}),\,{\sigma_i }^{2}({t}^{j})),$$where \({M}_{i}=({{m}_{i}}^{j},{t}^{j})\) corresponds to the pair of methylation value and age of each individual in the cohort. We approximated the full posterior distribution, using a Markov chain Monte Carlo algorithm and extracted the maximum a posteriori (MAP) values of the parameters for each site using the PyMC Python package.To ease interpretation of our results, we take the log2 transformation of the original acceleration parameter, centering the cohort average at 0, and report the MAP estimates for both acceleration and bias.Model comparisonSimilarly, we fitted a probabilistic linear model of the form$$P({M}_{i})=\prod _{({{m}_{i}}^{j},{t}^{j})\in {M}_{i}}{Nor}{m}_{p.d.f.}({m}_{i},{a_i t}+b_i,{c_i t}+d_i),$$and inferred the posterior distribution and performed a model comparison using the PyMC Python package40. Model comparison approximated the expected log-predictive density (ELPD) of both models on each site using an approximation of leave-one-out cross validation (LOO-CV) based on the Pareto-smoothed importance sampling. The higher ELPD value on each site highlights the favored model to explain the observed dynamics. As the evolution of methylation across CpG sites is assumed to be independent of each other, we compute the overall ELPD for either model by summing the reported value on all fitted sites.Saturation filteringFurther quality control measures were taken to ensure that only sites that capture the full expected dynamics of methylation changes with time are retained for further modeling (Extended Data Fig. 2a).We observed that the methylation dynamics are constrained between 0 and 1. This reduces the possibility to observe deviations from the mean in sites approaching these boundaries. We therefore dropped sites for which the 95% CI predicted by our model reached a threshold of either 0.005 and 0.95, at either birth or 90 years. A total of 204 sites were observed to display this phenomenon.Further, we noticed that the above-mentioned phenomenon might occur at different thresholds, due to different batch-correction processes; however, in these sites, we should see a plateau in the evolution of the methylation dynamics. We therefore computed the derivative of the mean methylation dynamics for each site, \(m{\prime} (t)=\omega {e}^{-\omega t}(\eta -p)\), and valued it at t = 90. This value is a measure of our distance to steady state at old age, \({ds}=m{\prime} (90)\), at old age. We then filtered all sites with low distances to steady state, ds < 0.001 (n = 654).Overall, a total of 1,160 sites were taken forward for acceleration and bias modeling.Acceleration and bias modelFirst, we define the probability of observing a methylation pattern in an individual by comparing its methylation levels to those obtained from cohort fitting on each CpG site. That is, if an individual of age t has methylation levels \(\{{m}_{i}{\}}_{i\in I}\) on a set of CpG sites \(I\), we define its probability of observation as$$P(\{{m}_{i}\}_{i\in I},t)={\prod }_{i\in I}\quad{Bet}{a}_{p.d.f.}({m}_{i},\mu_i(t),\,{\sigma_i }^{2}(t)),$$where \(m(t)\) and \({\sigma }^{2}(t)\) are defined as above, using the MAP values for all site parameters in the model obtained from the cohort fitting. We then consider two extra parameters α and β in our model. Mathematically, we multiply the speed of reactions VU and VM uniformly across all sites by a factor α. Notice that this translates in a proportional increase of the total speed of reaction, which becomes \(\alpha {\omega }_{i}\) for each CpG site i. Parameter α therefore represents a uniform increase in the average speed of epigenetic aging across all measured sites. Analogously, parameter β modifies uniformly all sites shifting the mean intercept, but not disturbing the expected speed of change. These two parameters modify the mean evolution of methylation on for each site as follows:$$\mu_i(t,\alpha ,\beta )=\eta_i +{e}^{-\alpha \omega_i t}(p_i-\eta_i )+\beta .$$Details on how these parameters modify the evolution of the variance can be found in the Supplementary Methods Section 4. We can then compute the probability of observing an individual conditional on a given acceleration and bias$$P(\{{m}_{i}\}_{i\in I},t{\rm{|}}\alpha ,\beta )={\prod }_{i\in I}\quad{Bet}{a}_{p.d.f.}({m}_{i},\mu_i(t,\alpha ,\beta ),\,{\sigma_i }^{2}(t,\alpha ,\beta )).$$We then infer the posterior distribution of parameters α and β and compute the MAP values that maximize the probability of observing the methylation levels in an individual across sites using the PyMC Python package.DownsamplingTo determine the optimal amount of CpG sites included for the inference of the acceleration and bias of each individual we conducted a downsampling experiment. We inferred acceleration and bias using increasing numbers of CpG sites, ordered decreasingly according to their absolute correlation with age. We then computed the absolute difference between the inferred values for each fit to infer the stability of our predictions as a function of the sites included in the model. We found that there are no benefits in including more than 250 sites (Extended Data Fig. 1b).Order of magnitude difference analysisTo quantify the differences in the predicted age accelerations between a model with masked data and a model trained with all data, we computed the differences between the order of magnitude of the absolute change between model predictions and the order of magnitude of accelerations predicted by the unmasked model,$${OMD}=\log \left|{ac}{c}_{{mask}}-{ac}{c}_{{full}}\right|-\log \left|{ac}{c}_{{full}}\right|.$$Negative values of the order of magnitude suggest that the differences between the mask model and the full model are negligible.Global hypo–hyper methylation testWe tested the robustness of our accelerated person model by transforming the training GSset1 cohort, by applying a global hypo or hyper methylation to all the data points. Then we inferred the acceleration and bias values on this transformed dataset using the site parameters inferred from the not transformed dataset (Extended Data Fig. 1d). As a benchmark, we also predicted the age acceleration using the Horvath epigenetic clock which shows little robustness against these type of methylation transformations.Site batch correctionWhen applying our model to external datasets, we do not retrain the biological model on the new data. Instead, we applied an offset to the evolution of the mean \(m(t)\), for each site, inferred using our reference dataset GSset1. Mathematically, for each site we find the offset, \({o}_{i}\), which maximizes the probability of observing$$P({M}_{i})={\prod }_{({{m}_{i}}^{j},{t}^{j})\in {M}_{i}}\quad{Bet}{a}_{p.d.f.}(m,\mu_i(t)+{o}_{i},\,{\sigma_i }^{2}(t)).$$The full mathematical description of our algorithm can be found in Supplementary Methods section 5. We benchmarked our results on the Hannum dataset by comparing a fully retrained model on this dataset and the model trained on GSset1 applied both with and without batch correction. We found that offset does not have a clear directionality across sites and should therefore be applied separately for each CpG site considered in the model.Down syndrome analysisWe used the control group in the Down syndrome dataset24 to fit our batch-correction algorithm and inferred acceleration and bias parameters for the control (n = 58) and disease (n = 29) groups. We then fitted a logistic regression model of the form \({disease} \sim {scale}({acceleration})\).Association analysisWe utilized the linked health data records of GSset1, GSset2 and GSset3 (n = 15,900) to perform association studies grouped into three categories associated with the nature of the studied traits: continuous, disease and mortality.For each continuous phenotype, outlier values that were more than 3.5 × s.d. away from the mean were removed before analysis. We also applied a log transformation to the body mass index to normalize the data. Finally, we used log transformations on alcohol consumption to reduce skewness in their distributions. This was carried out by applying a log(units + 1) transformation. We quantified the continuous trait of smoking using our developed weighted_smoke parameter (described above).Statistical analysisLinear regression models of the form \({scale}({phenotype}) \sim\) \({scale}(\alpha )+{scale}(\beta )+{age}+{sex}\) were used to examine the association between continuous traits and the acceleration and bias measurement. Linear regression models of the form \({scale}({phenotype}) \sim\) \({scale}({clock})+{age}+{sex}\) were used to examine the association between continuous traits and the four epigenetic clocks. Scaling is performed to standardize the data to a mean of zero and a variance of one. Logistic regression was used to test the association between categorical disease phenotypes and acceleration and bias. We fitted a Cox proportional hazards regression to examine whether our measures of epigenetic biological aging were associated with incidences of all-cause mortality. All of the results were corrected using the FDR method.GWAS and functional analysisGenotype–phenotype association analyses were performed using a linear mixed model GWAS implemented in fastGWA GCTA41. A sparse GRM approach is used to adjust for sample relatedness. Overall, 15,871 overlapping individuals with nonmissing genotype and phenotype data were included. Variants with MAF < 0.01 or missingness rate > 0.1 were excluded from the analysis. Preliminary functional annotation was performed using FUMA42 and COJO41. Genomic risk loci were defined around significant variants (P < 5 × 10−8) and included all variants correlated (R2 > 0.1) with the lead variant. We used LDtrait43 to search for phenotypes associated with SNPs in linkage disequilibrium with the four lead SNPs (R2 > 0.1) and within 1 Mb. LDtrait reports association data from the GWAS catalog25. All coordinates in this study were based on human reference genome assembly GRCh37/hg19 (see website in Framework Implementation). Gene annotations were based on GENCODE annotation release 39 (see website listed in ‘Framework implementation’).Framework implementationThe inference of the posterior distribution of model parameters was implemented in Python version 3.9 with dependencies on PyMC 5.0.2 41, Numpy 1.24.1 45, Anndata 0.8.0 46 and Pandas version 1.5.3 47. Cox proportional hazards regression was done using the survival package version 3.4.0 48, while linear and logistic regression were done using the stats base package under R base version 4.2.2. GWAS summary statistics was generated using GCTA version 1.93.2. Functional analysis of the GWAS results was done using FUMA version 1.5.1. Human reference genome assembly used for GWAS: http://www.ncbi.nlm.nih.gov/assembly/2758/. Gencode annotation researle 39: https://www.gencodegenes.org/human/release_39.html.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles