Profiling the transcriptomic age of single-cells in humans

Description of the input datasetsThis study used publicly available datasets, all of which were produced by single-cell RNA sequencing of human cells (Supplementary Table 1). The collection and processing of samples were done by the original authors of the studies, this study relies on the available gene expression datasets. The main dataset we used was The Asian Immune Diversity Atlas (AIDA, https://explore.data.humancellatlas.org/projects/f0f89c14-7460-4bab-9d42-22228a91f185)23, which comprises 1,058,909 peripheral blood mononuclear cells (PMBCs) in total from 508 healthy, human donors. The age of the donors ranges from 19 to 75 years, with a mean of 42 years and a median of 41 years. Cell type annotation was done by the AIDA team in multiple steps in a hierarchical manner and was available for the dataset. Clustering and annotation of cells were performed on four levels, the marker genes identifying each cell type are available in the Supplementary Table S2 of the preprint of the AIDA dataset23.The study of Yazar et al. presents a dataset of 1,248,980 PBMCs from 981 healthy, human donors with ages ranging from 19 to 97 years29 (referred to as “eQTL dataset”).Yoshida et al. analyzed single-cell samples from healthy, COVID-19 infected and post-COVID-19 patients30. The COVID-19 set includes donors with RT-qPCR test positive for SARS-CoV-2 and in asymptomatic, mild, moderate, or severe conditions. The post-COVID-19 set consists of donors recovered from COVID-19, including symptomatic patients as well. Although Yoshida et al. carried out multi-omic analyses, here we only used the provided PBMC dataset, which consists of 422,220 cells from 75 donors (referred to as “Yoshida et al. dataset”), and we focus on the healthy and COVID-19 samples. The dataset does not contain information about the exact chronological age of donors, but they were assigned to age groups. It contains both pediatric and adult samples, the developmental stage of the donors ranges from the newborn to the elderly (aged adult) stage. Liu et al. collected and analyzed PBMC samples from healthy and COVID-19 infected patients31. The COVID-19 set includes donors with positive nasopharyngeal swab and/or positive serology for SARS-CoV-2 infection31 and in moderate, severe or critical conditions. The dataset consists of 372,081 cells from 46 donors with age ranging from 24 to 84 years (referred to as “Liu et al. dataset”). The study of Stephenson et al. presents a dataset of 647,366 PBMCs from healthy and COVID-19 infected, non-COVID-19 and IV-LPS patients32 (referred to as “Stephenson et al. dataset”). The COVID-19 group includes patients in asymptomatic, mild, moderate, severe and critical conditions. In this study, we focus on the healthy and COVID-19 cases. The dataset does not contain information about the exact chronological age of every donor, but they were assigned to age groups from the 3rd to the 10th decade.Sikkema et al. presented a large-scale, integrated single-cell reference atlas of the human lung33. Here, we only used data of cells derived from the lung parenchyma of healthy donors, which dataset consists of 333,468 cells from 66 donors with age ranging from 15 to 76 years (referred to as “Lung dataset”).Single-cell RNA sequencing of 124 human pre-implantation embryos and embryonic stem cells was done by Yan et al.35 (referred to as “Yan et al. dataset”). The following development stages were present in the dataset (number of cells in parentheses): oocyte (3), zygote (3), 2-cell embryo (6), 4-cell embryo (12), 8-cell embryo (20), morulae (16), late blastocyst (30), hESC passage 0 (8), hESC passage 10 (26). The study of Petropoulos et al. presents a dataset consisting of 1529 human preimplantation embryo cells36 (referred to as “Petropoulos et al. dataset”). The following development stages (embryonic days) were present in the dataset (number of cells in parentheses): embryonic day 3 (81), embryonic day 4 (190), embryonic day 5 (377), embryonic day 6 (415), embryonic day 7 (466). Finally, the dataset of Meistermann et al. contains 150 human preimplantation embryo cells37 (referred to as “Meistermann et al. dataset”). The following development stages (branches) were present in the dataset (number of cells in parentheses): pre-morula (1), morula (30), early blastocyst (5), inner cell mass (9), early trophectoderm (62), epiblast (36), primitive endoderm (1), TE.NR2F2- (5), TE.NR2F2+ (1).The AIDA, eQTL, Yoshida et al., Liu et al., Stephenson et al. and Lung datasets were downloaded from https://cellxgene.cziscience.com/. The Yan et al. data was retrieved from the Gene Expression Omnibus48, the Petropoulos et al. dataset from BioStudies49 and the Meistermann et al. dataset from Mendeley Data (https://data.mendeley.com).Data pre-processingEach single-cell PBMC and lung dataset (AIDA23, eQTL dataset29, Yoshida et al. dataset30, Liu et al. dataset31, Stephenson et al. dataset32 and Lung33) were downloaded as h5ad files containing AnnData objects. The raw gene expression counts were accessed either through the main layer of the object or through the .raw layer when transformed counts were assigned to the main. In the case of each dataset, the raw, gene-by-cell count matrix was considered as the base data for our analyses. Then, raw counts were transformed by a log-normalization step, i.e. for each cell, each gene expression count value was divided by the total expression count of the given cell, multiplied by 10,000, and transformed with log1p transformation. Formally: \({\hat{c}}_{ij}=\log (1+[\frac{{c}_{ij}}{{c}_{i}}\cdot 10,000])\), where cij is the raw expression count of gene j in cell i, ci is the total expression count in cell i, \(\log\) indicates the natural logarithm and \({\hat{c}}_{ij}\) denotes the resulting log-normalized count.The embryogenesis data of Yan et al.35 were downloaded as individual txt files for each of the 124 cells, containing the raw expression count of the detected genes. These files were processed to obtain a cell-by-gene matrix containing the corresponding gene expression values. Missing values in this matrix, i.e. expression counts of genes that were not detected in a given cell were treated as 0 values. Then the resulting raw count matrix was log-normalized in the same way as the other datasets described above. The Petropoulos et al. dataset36 were accessed as a single txt file containing the gene by cell count matrix of raw expression values. After transposing, it was log-normalized in the same way as described above. The dataset of Meistermann et al.37 was accessed through the Supplementary data repository of Radley et al.50. The tsv file containing the raw gene expression values was filtered to obtain the data of the cells introduced in the study of Meistermann et al. The resulting count matrix was then log-normalized. The development stages (branches) present in this dataset were adjusted to have similar stages as the other embryogenesis datasets namely, a blastocyst stage has been created and all cells of the following branches were assigned to it: inner cell mass, early trophectoderm, epiblast, primitive endoderm, TE.NR2F2-, TE.NR2F2+. Furthermore, the pre-morula stage has been removed from the dataset due to the insufficient number of samples (1 cell).Calculation of transcriptional noiseThe variability of cells, i.e. transcriptional noise in the AIDA dataset was calculated for each (donor, cell type, gene) triplet. Standard deviation (std) and interquartile range (IQR) of the log-normalized gene expressions of a given gene were calculated (for each donor and cell type). One-sample, two-sided t-tests were calculated with each metric separately to test if the transcriptional noise of a given gene in a given cell type significantly differs from 0. In addition, Spearman’s correlation between chronological age and transcriptional noise was measured for each cell type and gene.Gene set enrichment analysisGene set enrichment analysis was performed with the GSEA software51,52, for each of the top 5 best-performing single-cell clocks, separately. The data of normalized expression values of cells of the given cell type was used as the expression dataset, the REACTOME_CELLULAR_SENESCENCE gene set from the Molecular Signatures Database (MSignDB) as the gene set and the predicted age of the cells as phenotype labels. The genes in the expression data were mapped to the cellular senescence gene set prior to the analysis, consequently the analysis was run with the “No_Collapse” setting. The number of permutations was set to 100 and the metric for ranking genes to Spearman correlation, otherwise the default parameter setting was used.Age prediction modelsChronological age prediction models (i.e., aging clocks) were developed using the AIDA dataset. Single-cell models were developed for each cell type, separately, resulting in cell-type-specific aging clocks. Pseudo-bulk data-based models were trained on cells averaged by donor, cell-type-specific pseudo-bulk models were trained on cells averaged by donor and cell type. We used the categorization of cell types created and assigned to the cells by the original authors of the dataset, and available in the retrieved AnnData object. For each cell type, ElasticNet models were trained on the log-normalized gene expressions of single cells to predict chronological age. 5-fold cross-validation (CV) on the donor level was used to prevent overfitting, i.e. a model was trained on cells from \(\frac{4}{5}\) of the donors, and tested on the remaining cells (coming from the remaining \(\frac{1}{5}\) of the donors), thus there was no overlapping of the train and test set of a given model, even on the donor level. The cross-validation resulted in five models (for each cell type and model type) and the predicted age of a given cell was considered to be the prediction of the model that used it as a test sample. The ElacticNet model was trained by using the default parameter settings of the scikit-learn implementation of the model (l1_ratio = 0.5, alpha = 1). We have also experimented with the glmnet implementation of the ElasticNet model, which fine-tunes the alpha parameter (lambda with the glmnet terminology), however, as we observed a similar performance along with a much higher computational demand, we relied on the default scikit-learn model in this study.Using the single-cell predictions, we additionally calculated donor-level age predictions by averaging the predicted age of cells of a given donor. The averaging was done for each cell type, separately, resulting in cell-type-specific age predictions for each donor. The average of the age predictions of all cells of a given donor resulted in the general (non-cell-type-specific) age prediction of the donor.Besides the single-cell aging clocks, cell-type proportion-based models were also developed motivated by the work of Zhu et al.22. For this purpose, the proportion of each cell type compared to the total number of cells of the donor was calculated and used to predict chronological age. The glmnet implementation of the ElasticNet algorithm was trained on the cell type proportion values using 5-fold cross-validation. We have also experimented with Partial Least Square Regression and XGBoost models, but due to their similar or weaker performance compared to ElasticNet, we excluded these results from the study.Combined chronological age predictions were calculated based on the single-cell gene expression clocks and cell-type proportion clocks. To predict the donor age in this way, the average of the donor-level prediction given by a cell-type-specific single-cell clock and the predicted age given by a cell type proportion-based clock was calculated. Formally:$${\,{\mbox{PredAge}}\,}_{{{\rm{combined}}}}^{d}=\frac{{{\mbox{avg}}}_{i}\left({\,{\mbox{PredAge}}\,}_{{{\rm{scE}}}}^{d,i}\right)+{\,{\mbox{PredAge}}\,}_{{{\rm{P}}}}^{d}}{2},$$
(1)
where \({\,{\mbox{PredAge}}\,}_{{{\rm{scE}}}}^{d,i}\) is the predicted age of cell i of donor d given by a cell-type-specific single-cell clock, \({\,{\mbox{PredAge}}\,}_{{{\rm{P}}}}^{d}\) is the predicted age of donor d given by a cell type proportion clock and \({\,{\mbox{PredAge}}\,}_{{{\rm{combined}}}}^{d}\) denotes the resulting combined predicted age of donor d.For the evaluation of the described aging clocks, we measured the Mean Absolute Error (MAE) and the Pearson correlation of the chronological and predicted age.Model validation on external datasetsFor the external validation of the developed aging clocks, we used the eQTL dataset (Yazar et al.29), the healthy samples of the Yoshida et al. dataset30, the Liu et al. dataset31, and the Stephenson et al. dataset32. In the case of all datasets cell-type-specific single-cell clocks were applied to the common cell types, based on the categorization of cell types available for each dataset. Additionally, CD8-positive alpha-beta memory T cell clock was applied to the narrower types, i.e. to the central and effector memory CD8-positive alpha-beta T cell cells in the case of the eQTL dataset, the central, effector, and terminally differentiated effector memory CD8-positive alpha-beta T cells in the case of the Yoshida et al. dataset, and the effector memory CD8-positive, alpha-beta T cell in the case of the Stephenson et al. dataset. In the case of the Stephenson et al. dataset the innate lymphoid cell clock was applied to the group 2 innate lymphoid cell and ILC1 cell types, the plasma cell clock to the IgM, IgA and IgG plasma cells, and the memory B cell clock to the class switched and unswitched memory B cells. Moreover, the CD8-positive, alpha-beta cytotoxic T cell clock was applied to the effector CD8-positive, alpha-beta T cell-typed cells in this dataset. The CD14-positive monocyte clock was applied to the classical, the CD14-low, CD16-positive monocyte clock to the non-classical monocyte cells and the monocyte clock to the intermediate monocytes (Yoshida et al. and Liu et al. datasets). In the case of the Liu et al. dataset both the effector and central memory CD4-positive, alpha-beta T cell clocks were applied to the CD4-positive, alpha-beta memory T cells. Consequently, 23 cell-type-specific clocks were applied to the eQTL dataset, 21 to the Yoshida et al. dataset, 20 to the Liu et al. dataset and 26 to the Stephenson et al. dataset. In all datasets the primary ID of genes was their Ensembl ID, thus it was possible to directly map the genes to clock features in the external validation datasets. During the application of clocks, missing values were handled with average imputation, that is the expression value of a gene that was not presented in the validation set but presented among the clock features was imputed with the average (log-normalized) expression value of the gene in the training (AIDA) dataset, where the average was taken over all cells of the given cell type. In all cases, all of the five clocks of the cross-validation of the training set (AIDA) were applied to the external datasets, and for each cell, the average of the five predictions was considered as the final predicted age of the cell.The application of the cell-type proportion clocks was done similarly to the single-cell clocks, namely, all five fitted models of the cross-validation were applied to the external validation data, and the average of their prediction was considered to be the final predicted age of a donor. Missing value imputation was done with the average proportion of a given cell type in the training set (AIDA), where the average was taken over all donors in AIDA. Proportions of narrower cell types mentioned above were summarized thus average imputation was not performed in these cases. Combined chronological age predictions were calculated on the external validation sets, according to Equation (1).For the further validation of the proposed single-cell aging clocks, we used the Lung dataset (Sikkema et al.33). Similarly to the application of the clocks to PBMC data, the cell-type-specific clocks were applied to the common cell types. In the case of higher-level categorization of cell types in the Lung dataset, all the corresponding narrower cell-type-specific clocks were applied to the cells and their predictions were averaged for each cell (e.g. B cell, memory B cell, naive B cell and mature B cell clocks were all applied to the B cells in the Lung dataset). In total, predicted age for cells of 11 cell types were calculated.The evaluation of the clocks on the external datasets was done based on MAE and Pearson correlation of the chronological and predicted ages when exact ages were assigned to donors (in the case of the eQTL, Liu et al. and Lung datasets). When only age groups were available (in the case of the Yoshida et al., and the Stephenson et al. datasets), we calculated Spearman correlation coefficients between the age group and the predicted age variables, as well as t-tests for the comparison of two groups. In addition to the exact ages in the eQTL and Liu et al. datasets, we assigned the donors to two age categories, motivated by the groups of the Yoshida et al. dataset, to adult (18 ≤ age < 65) and aged adult (65 ≤ age) groups, and the clock predictions were compared between these age groups.Application of clocks to additional datasetsThe created single-cell clocks were applied to the COVID-19-infected samples of the Yoshida et al., Liu et al., and Stephenson et al. datasets as well. The ElasticNet-based clocks were applied to the samples the same way as in the case of the healthy samples, described above in the Model validation on external datasets section. Predicted age of samples in the different severity groups and from healthy individuals were compared to assess the effect of SARS-CoV-2 infection on biological age captured by the proposed aging clocks. In order to account the age distribution differences of the different groups we included chronological age as a confounder variable in the comparison of the groups. In the Yoshida et al. dataset more precise chronological age was available to a few donors, but to properly compare the different groups, every donor was assigned either to the adult (18 ≤ age < 65) or aged adult (65 ≤ age) groups. Similarly, in the case of the Stephenson et al. dataset, every donor was assigned to an age group from 3rd to 10th decade.In the case of the three embryogenesis datasets, we could apply only the single-cell clocks. Multiple ElasticNet-based cell-type-specific clocks were applied to the samples in a similar way as described above, namely, missing values were imputed with the corresponding average expression value in the training (AIDA) dataset, and the average prediction of the five CV clocks was considered as the final predicted age of a sample (see the Model validation on external datasets section for more details). In contrast to the other datasets used in this study, where genes were identified with their Ensembl ID, here they were represented by their name (in all three datasets). To map the genes to clock features two approaches were applied. Direct mapping was used if a gene could be mapped directly by its name to the gene set of the training dataset (AIDA) where gene names were also available. Otherwise, using the mapping function of UniProt (https://www.uniprot.org), gene names were mapped to Ensembl IDs and matched with the gene set of AIDA through these IDs. Duplicates, i.e. when more than one gene name were mapped to an Ensembl ID, were checked manually to make sure that the correct name was assigned to the ID.Explanation of changes in predicted agesIn the case of the changes in predicted ages shown by the single-cell clocks consistent over all three COVID-19 datasets (Yoshida et al., Liu et al., Stephenson et al.), we searched for the genes supporting these changes. For a given cell type, the genes having non-zero regression coefficient in all five CV clocks were selected, and their normalized expression in each dataset were further examined. The different severity groups were compared based on the expression of a given gene, similarly as described above, accounting for the age distribution differences by including age as a confounder variable. A gene was selected to support the decrease in predicted ages between two groups, if its expression decreases too, while the gene is upregulated with aging according to the clock (i.e. shows opposing pattern to aging), and similarly if its expression increases. Based on the same considerations, a gene was selected to support the increase in predicted ages between two groups, if its expression increases too and the gene is also upregulated with aging according to the clock (i.e. shows the same pattern as during aging), and similarly if its expression decreases.In the case of the embryogenesis analyses, we searched for clock feature genes supporting the decrease of predicted ages at the morulae stage. Since all the clocks show a similar aging (and rejuvenation) process during early development, some common genes are assumed to drive this process. Genes having the 10 highest (in absolute value) regression coefficient in all five CV clocks for more than one cell-type-specific single-cell clocks have been selected for further examination. A gene is assumed to be supporting the rejuvenation process, if its expression decreases at morulae, but increases during aging, and similarly, if its expression increases at morulae, but decreases with aging.StatisticsThe significance of the differences between the two different groups in the external validation analyses was calculated by two-sample, two-sided Student’s t-tests. In the case of the COVID-19-related analyses, the significance of the difference between two groups was calculated by a generalized linear model fitted to the predicted age/gene expression with severity as the independent variable and age as a confounder variable. The significance of the differences between the two different groups in the embryogenesis analyses was calculated by two-sample, two-sided Welch’s t-tests. ****: p ≤ 0.0001, ***: 0.0001 < p ≤ 0.001, **: 0.001 < p ≤ 0.01, *: 0.01 < p ≤ 0.05, ns:0.05 < p.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles