Exploring the correlation between DNA methylation and biological age using an interpretable machine learning framework

DNA methylation dataThe DNA methylation data for this study contained a total of 10,296 samples, of which 7,833 were healthy samples, and 2,463 were diseased samples; of these samples, 8,233 samples contained biological age data and were defined as training samples for constructing machine learning models. Each training sample contains 50,000 methylation site data, 1 item of gender data, and 1 item of biological age data, and DNA methylation data is measured by the methylation 450 K platform. Of the 8233 samples used to construct the machine learning model, 6266 were healthy samples, and 1967 were diseased samples; in terms of gender distribution, 4409 (53.55%) samples were male samples, and 3824 (46.45%) samples were female samples. Biological age had a minimum value of 0 years, a maximum value of 114 years, a median value of 56 years, a mean value of 53.6597 years, and a standard deviation of 25.8249 years.Data preprocessing resultsFirst, the statistics showed that 23,688,484 methylation sites data were missing, accounting for 5.7544% of the total data volume (total data volume was 411658233), and all the vacant data were filled with 0. Secondly, we performed data coding on the gender data: 4409 male category data were coded as 0, and 3824 female category data were coded as 1. Subsequently, we performed a normality test on the 50,000 DNA methylation site data. The results showed that none of the 50,000 DNA methylation site data fulfilled the normal distribution (all P-values were less than 1%). Finally, we plotted the distribution histograms and kernel density estimates for the sex and biological age data. See Fig. 1a and b.Fig. 1(a) Density estimate of biological age distribution; (b) Histogram of gender distribution; (c) Regression plots, residual plots for training and testing of feature selection models XGBoost, LightGBM, and CatBoost models; (d) Plot of importance of top 20 features of XGBoost model; (e) Plot of importance of top 20 features of LightGBM model; (f) CatBoost model top 20 feature importance plot; (g) Best model (LithtGBM) top 20 feature SHAP values.Feature selection resultsThe XGBoost, LightGBM, and CatBoost models were used for the wraparound strategy for feature selection, and 50,000 methylation sites and gender data were used as feature data. 80% (6586 samples) of the data were used as training sets, and 20% (1647 samples) were used as test sets. The model training MAE and testing MAE are shown in Table 1.Table 1 Model training MAE and test MAE.Table 1 shows that the XGBoost model has the lowest training MAE (0.0539), the LightGBM model has the highest training MAE(0.6815), the LightGBM model has the lowest testing MAE(3.3041), and the CatBoost model has the highest testing MAE(4.0189). When analyzing the test MAE, LightGBM obtained the lowest MAE.The regression and residual plots of the predicted biological age versus actual age are shown in Fig. 1c. As can be seen in Fig. 1c, in training set, the XGBoost model predicted results with the actual biological age R2 value of 0.9997, which is the best performance, and the LightGBM model predicted results with the actual biological age R2 of 0.9978, which is the worst performance; in the test dataset, the LightGBM model predicted results with the actual results R2 of 0.9543, while the CatBoost model predicted results with actual biological age R2 of 0.9327, which is relatively poor performance. Regarding the prediction results’ stability, the XGBoost model has the smallest average residuals of 2.8035e-06 in the training dataset. In contrast, the LightGBM model has the most significant average residuals of 0.0037. In the test dataset, the XGBoost model had the smallest average residual of -0.0193, while the LightGBM model had the most significant average residual of 0.1005.Considering the model training MAE, testing MAE, training dataset prediction result vs. real biological age R2, testing dataset prediction result vs. real biological age R2, training dataset prediction result vs. actual biological age average residuals, and testing dataset prediction result vs. actual biological age average residuals together, the top 20 methylation sites screened by the LightGBM model were selected as the feature data for constructing the biological age prediction model. Figure 1d and e, and 1f show the top 20 methylated sites output by the XGBoost, LightGBM, and CatBoost models based on feature importance, respectively, and Fig. 1g shows the top 20 methylated sites output by the LightGBM model based on feature importance SHAP.Statistical analysis resultsThe 20 methylation data obtained by feature selection were first tested for normality using the Scipy library regular test21 function, and the test results are shown in Table 1 of the supplementary document. Figure s1 shows the histogram of data distribution and kernel density estimation for the 20 methylation sites. Based on the analysis of normality tests, histograms, and kernel density estimation plots, not all methylation data conformed to a normal distribution. Therefore, the Spearman correlation coefficient was used to assess the correlation between the 20 methylation data and biological age. The results of the correlation analysis are shown in Table 2, and Fig. 2a demonstrates the heat map of the Spearman correlation coefficient.Table 2 20 Spearman correlation coefficients of methylation data with biological age.Table 2; Fig. 2a correlate the 20 methylation sites and biological age. We selected the features with an absolute correlation coefficient value greater than 0.45 as the final features for constructing the machine learning model, and 15 groups of methylation data were obtained. To understand the distribution of individual methylation data as well as to detect further and process abnormal data to improve data quality, we calculated the descriptive statistical features (mean, standard deviation, minimum, first quartile, median, third quartile, and maximum) of the 15 groups of methylation data (Table 2 of the supplementary document) and plotted box plots (Fig. 2b) and scatter plots (Figure s2) of the 15 groups of methylation data.Fig. 2(a) Heat map of Spearman’s correlation coefficient; (b) Box plot of 15 methylation data; (c) Box plot of methylated data after data normalization and scaling; (d) Box plots of biological age in healthy and diseased groups; (e) Box plots of biological age in the diseased group.From Table 2 of the supplementary document, Figure s2, and Fig. 2b, it can be seen that the distribution consistency of the 15 groups of methylation data is relatively poor, and there are outliers and outliers in most methylation data. Further observation of the box plots shows that there are generally outliers in the methylation data of the 15 groups, a result that matches the conclusions of the descriptive statistical analysis and the regression scatterplot; meanwhile, the box plots show that there is a significant difference in the distribution range of the methylation data of the 15 groups. To ensure the data quality and accelerate the convergence of the machine learning model, we first replaced the outliers and anomalies with the median according to the regression scatter plot (Fig. s1 of the supplementary document) and the box plot (Fig. 2b), and then standardized and normalized the data by using the StandardScaler22 and MinMaxScaler23 functions of the Sklearn library to scale the data between − 1 and + 1. The feature scatter plots and box plots after anomaly data replacement, normalization, and scaling are shown in Figure s3 and Fig. 2c.The results of the Mann-Whitney U test (statistic value: 3327968.5, P value: 0.00) and the Kolmogorov-Smirnov test (statistic value: 0.325113, P value: 0.00) demonstrated a statistically significant discrepancy in biological age between the healthy and diseased groups. Figure 2b illustrates the distribution of biological age between the healthy and diseased groups. The Kruskal-Wallis H-test (statistic: 1751.851222, p-value: 0.00) demonstrated that there was also a statistically significant difference in biological age between the various disease groups. Figure 2e depicts the distribution of biological age among the diseases.Machine learning model training resultsFollowing the processing of anomalous data, which included normalization and scaling, we selected XGBoost, LightGBM, CatBoost models, and deep neural networks as machine learning models for training purposes. The models mentioned above were trained using 10-fold cross-validation. XGBoost, LightGBM, and CatBoost models are all models built up based on tree structure, and the feature importance can be output after training to understand the degree of contribution of 15 sets of methylation data to biological age prediction24,25,26; the methylation sites that contribute more to biological age prediction can be used as a reference for the biological and medical neighborhoods to study DNA methylation and cellular and tissue aging; deep neural networks have a solid nonlinear fitting ability, which can fully explore the nonlinear relationship between the data27. The fully connected neural network was constructed using TensorFlow, using MAE as the loss function, initialized learning rate of 0.001, early stop and checkpoint techniques were applied to monitor the loss function and MAE changes in real-time during the training process, and the performance scheduling strategy was used to adjust the learning rate automatically. Different training rounds using the above parameters are shown in Figure s4. Testing 2000, 1000, and 800 rounds of training found that the MAE curves all showed a rising trend in the late stage of training, while when 300 rounds of training were chosen, the MAE curves did not rise in the late stage of training. Therefore, the number of training rounds for the deep neural network was finally determined to be 300.Table 3 shows the results of machine learning model training using 15 sets of methylation data and 1 item of gender data, and the deep neural network training MAE curve is shown in Fig. 3a.Table 3 Machine learning model training results.From Table 3, it can be seen that the XGBoost model has the lowest training MAE in the training dataset (0.0189), and the deep neural network has the highest training MAE (7.2615); in the testing dataset, the XGBoost model has the lowest testing MAE (3.6412), and the deep neural network has the highest testing MAE (5.4531). Therefore, the XGBoost model performs the best biological age prediction performance, and the deep learning model could perform better.The scatterplots and residual plots of the regression of the model prediction results with the actual biological age are shown in Fig. 3b. As can be seen from Fig. 3b: in the training dataset, the CatBoost model prediction results have a high degree of agreement with biological age (R2:0.9974), the profound neural network prediction results have a significant difference with biological age (R2:0.9117), and the average residuals for the CatBoost model are the lowest (-0.0001), and the average residuals for the deep neural network are the highest (0.2167). In the test dataset, the XGBoost model predicted relatively good agreement with biological age (R2: 0.9441), the deep neural network predicted poor agreement with biological age (R2:0.9124), the XGBoost model had the lowest average residual (0.0939), and the deep neural network had the highest average residual (0.3777). Considering the R2 values and average residuals of the model predictions with biological age, the XGBoost model showed high accuracy compared to the relatively poor performance of the deep neural network. Ultimately, the XGBoost model was parameterized to identify the optimal hyperparameters, utilizing a grid search with 10-fold cross-validation. After the grid search, the XGBoost model’s training MAE was 0.7689, the validation MAE was 3.6953, and the test MAE was 3.6089.The SHAP value integrates the effect of a given biological variable by itself and the impact of the variable’s interaction with other biological parameters. For a given individual (local interpretation), the sum of the SHAP values of all model variables represents the individual’s deviation from the mean of the actual age predicted by the entire data set. Finally, we used SHAP to calculate the SHAP values for the 15 methylation data sets in the XGBoost model. We plotted SHAP summary and bar stacked plots to visualize how much the 15 methylation data sets contributed to the biological age prediction, as in Fig. 3c. To ascertain the biological significance of the genes associated with the 15 groups of methylation sites, we initially queried the associated genes by methylation site CG number in the EWAS database (Table 3 of the Supplementary file). Following the database query, the final 12 groups of genes were subjected to a GO enrichment analysis (Fig. 3d, Figure s5, Figure s6) and a KEGG analysis (Fig. 3e, Figure s7) to gain insight into the cellular components, biological processes, molecular functions and pathways of the related genes.KEGG and GO analyses were implemented using clusterProfiler v4.10.1 in an R4.3.3 environment, and the following 12 proteins were ultimately used for KEGG and GO analyses: ARHGEF33、ABHD14A-ACY1、CCND3、AMER3、PCBP4、ELOVL2、ZNF518B、FHL2、ABHD14B、TRIM59、ABHD14A、ACKR1. A KEGG analysis revealed that the genes associated with 15 groups of methylation sites were primarily involved in arginine biosynthesis, fatty acid elongation, unsaturated fatty acid biosynthesis, 2-Oxocarboxylic acid metabolism, malaria, fatty acid metabolism, amino acid biosynthesis, and the p53 signaling pathway, alpha − linolenic acid metabolic process. A Gene Ontology (GO) enrichment analysis revealed that two of the 15 related genes were involved in nucleoside diphosphate metabolism, ribonucleoside bisphosphate metabolism, purine nucleoside diphosphate metabolism, alpha-linolenic acid metabolism, and ventricular cardiomyocyte differentiation. The genes were found to be involved in several processes, including cardiomyocyte development, fatty acid elongation, very long-chain fatty acid biosynthesis, ventricular cardiomyocyte differentiation, positive regulation of the activity of the cell-cycle-dependent protein serine/threonine kinases, and cardiac trabeculae formation. Each of the genes was found to be involved in one of these processes.Fig. 3(a) Deep neural network training MAE plot; (b) Machine learning model and deep neural network training and testing regression plots, residual plots; (c) SHAP plot of methylated features of XGBoost model; (d) GO enrichment analysis plot; e: KEGG analysis plot, KEGG data were obtained using the KEGG website enrichment28.

Hot Topics

Related Articles