Proteomic aging clock predicts mortality and risk of common age-related diseases in diverse populations

Study participantsThe UKB is a prospective cohort study with extensive genetic and phenotype data available for 502,505 individuals resident in the United Kingdom who were recruited between 2006 and 201040. The full UKB protocol is available online (https://www.ukbiobank.ac.uk/media/gnkeyh2q/study-rationale.pdf). We restricted our UKB sample to those participants with Olink Explore data available at baseline who were randomly sampled from the main UKB population (n = 45,441).The CKB is a prospective cohort study of 512,724 adults aged 30–79 years who were recruited from ten geographically diverse (five rural and five urban) areas across China between 2004 and 2008. Details on the CKB study design and methods have been previously reported41. We restricted our CKB sample to those participants with Olink Explore data available at baseline in a nested case–cohort study of IHD and who were genetically unrelated to each other (n = 3,977).The FinnGen study is a public–private partnership research project that has collected and analyzed genome and health data from 500,000 Finnish biobank donors to understand the genetic basis of diseases42. FinnGen includes nine Finnish biobanks, research institutes, universities and university hospitals, 13 international pharmaceutical industry partners and the Finnish Biobank Cooperative (FINBB). The project utilizes data from the nationwide longitudinal health register collected since 1969 from every resident in Finland. In FinnGen, we restricted our analyses to those participants with Olink Explore data available and passing proteomic data quality control (n = 1,990).Proteomic profilingProteomic profiling in the UKB, CKB and FinnGen was carried out for protein analytes measured via the Olink Explore 3072 platform that links four Olink panels (Cardiometabolic, Inflammation, Neurology and Oncology). For all cohorts, the preprocessed Olink data were provided in the arbitrary NPX unit on a log2 scale. In the UKB, the random subsample of proteomics participants (n = 45,441) were selected by removing those in batches 0 and 7. Randomized participants selected for proteomic profiling in the UKB have been shown previously to be highly representative of the wider UKB population43. UKB Olink data are provided as Normalized Protein eXpression (NPX) values on a log2 scale, with details on sample selection, processing and quality control documented online.In the CKB, stored baseline plasma samples from participants were retrieved, thawed and subaliquoted into multiple aliquots, with one (100 µl) aliquot used to make two sets of 96-well plates (40 µl per well). Both sets of plates were shipped on dry ice, one to the Olink Bioscience Laboratory at Uppsala (batch one, 1,463 unique proteins) and the other shipped to the Olink Laboratory in Boston (batch two, 1,460 unique proteins), for proteomic analysis using a multiplex proximity extension assay, with each batch covering all 3,977 samples. Samples were plated in the order they were retrieved from long-term storage at the Wolfson Laboratory in Oxford and normalized using both an internal control (extension control) and an inter-plate control and then transformed using a predetermined correction factor. The limit of detection (LOD) was determined using negative control samples (buffer without antigen). A sample was flagged as having a quality control warning if the incubation control deviated more than a predetermined value (±0.3) from the median value of all samples on the plate (but values below LOD were included in the analyses).In the FinnGen study, blood samples were collected from healthy individuals and EDTA-plasma aliquots (230 µl) were processed and stored at −80 °C within 4 h. Plasma aliquots were subsequently thawed and plated in 96-well plates (120 µl per well) as per Olink’s instructions. Samples were shipped on dry ice to the Olink Bioscience Laboratory (Uppsala) for proteomic analysis using the 3,072 multiplex proximity extension assay. Samples were sent in three batches and to minimize any batch effects, bridging samples were added according to Olink’s recommendations. In addition, plates were normalized using both an internal control (extension control) and an inter-plate control and then transformed using a predetermined correction factor. The LOD was determined using negative control samples (buffer without antigen). A sample was flagged as having a quality control warning if the incubation control deviated more than a predetermined value (±0.3) from the median value of all samples on the plate (but values below LOD were included in the analyses).We excluded from analysis any proteins not available in all three cohorts, as well as an additional three proteins that were missing in over 10% of the UKB sample (CTSS, PCOLCE and NPM1), leaving a total of 2,897 proteins for analysis. After missing data imputation (see below), proteomic data were normalized separately within each cohort by first rescaling values to be between 0 and 1 using MinMaxScaler() from scikit-learn and then centering on the median.OutcomesUKB aging biomarkers were measured using baseline nonfasting blood serum samples as previously described44. Biomarkers were previously adjusted for technical variation by the UKB, with sample processing (https://biobank.ndph.ox.ac.uk/showcase/showcase/docs/serum_biochemistry.pdf) and quality control (https://biobank.ndph.ox.ac.uk/showcase/ukb/docs/biomarker_issues.pdf) procedures described on the UKB website. Field IDs for all biomarkers and measures of physical and cognitive function are shown in Supplementary Table 18. Poor self-rated health, slow walking pace, self-rated facial aging, feeling tired/lethargic every day and frequent insomnia were all binary dummy variables coded as all other responses versus responses for ‘Poor’ (overall health rating; field ID 2178), ‘Slow pace’ (usual walking pace; field ID 924), ‘Older than you are’ (facial aging; field ID 1757), ‘Nearly every day’ (frequency of tiredness/lethargy in last 2 weeks; field ID 2080) and ‘Usually’ (sleeplessness/insomnia; field ID 1200), respectively. Sleeping 10+ hours per day was coded as a binary variable using the continuous measure of self-reported sleep duration (field ID 160). Systolic and diastolic blood pressure were averaged across both automated readings. Standardized lung function (FEV1) was calculated by dividing the FEV1 best measure (field ID 20150) by standing height squared (field ID 50). Hand grip strength variables (field ID 46,47) were divided by weight (field ID 21002) to normalize according to body mass. Frailty index was calculated using the algorithm previously developed for UKB data by Williams et al.21. Components of the frailty index are shown in Supplementary Table 19. Leukocyte telomere length was measured as the ratio of telomere repeat copy number (T) relative to that of a single copy gene (S; HBB, which encodes human hemoglobin subunit β)45. This T:S ratio was adjusted for technical variation and then both log-transformed and z-standardized using the distribution of all individuals with a telomere length measurement.Detailed information about the linkage procedure (https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=115559) with national registries for mortality and cause of death information in the UKB is available online. Mortality data were accessed from the UKB data portal on 23 May 2023, with a censoring date of 30 November 2022 for all participants (12–16 years of follow-up).Data used to define prevalent and incident chronic diseases in the UKB are outlined in Supplementary Table 20. In the UKB, incident cancer diagnoses were ascertained using International Classification of Diseases (ICD) diagnosis codes and corresponding dates of diagnosis from linked cancer and mortality register data. Incident diagnoses for all other diseases were ascertained using ICD diagnosis codes and corresponding dates of diagnosis taken from linked hospital inpatient, primary care and death register data. Primary care read codes were converted to corresponding ICD diagnosis codes using the lookup table provided by the UKB. Linked hospital inpatient, primary care and cancer register data were accessed from the UKB data portal on 23 May 2023, with a censoring date of 31 October 2022; 31 July 2021 or 28 February 2018 for participants recruited in England, Scotland or Wales, respectively (8–16 years of follow-up).In the CKB, information about incident disease and cause-specific mortality was obtained by electronic linkage, via the unique national identification number, to established local mortality (cause-specific) and morbidity (for stroke, IHD, cancer and diabetes) registries and to the health insurance system that records any hospitalization episodes and procedures41,46. All disease diagnoses were coded using the ICD-10, blinded to any baseline information, and participants were followed up to death, loss-to-follow-up or 1 January 2019. ICD-10 codes used to define diseases studied in the CKB are shown in Supplementary Table 21.Missing data imputationMissing values for all nonproteomics UKB data were imputed using the R package missRanger47, which combines random forest imputation with predictive mean matching. We imputed a single dataset using a maximum of ten iterations and 200 trees. All other random forest hyperparameters were left at default values. The imputation dataset included all baseline variables available in the UKB as predictors for imputation, excluding variables with any nested response patterns. Responses of ‘do not know’ were set to ‘NA’ and imputed. Responses of ‘prefer not to answer’ were not imputed and set to NA in the final analysis dataset. Age and incident health outcomes were not imputed in the UKB. CKB data had no missing values to impute.Protein expression values were imputed in the UKB and FinnGen cohort using the miceforest package in Python. All proteins except those missing in >30% of participants were used as predictors for imputation of each protein. We imputed a single dataset using a maximum of five iterations. All other parameters were left at default values.Calculation of chronological age measuresIn the UKB, age at recruitment (field ID 21022) is only provided as a whole integer value. We derived a more accurate estimate by taking month of birth (field ID 52) and year of birth (field ID 34) and creating an approximate date of birth for each participant as the first day of their birth month and year. Age at recruitment as a decimal value was then calculated as the number of days between each participant’s recruitment date (field ID 53) and approximate birth date divided by 365.25. Age at the first imaging follow-up (2014+) and the repeat imaging follow-up (2019+) were then calculated by taking the number of days between the date of each participant’s follow-up visit and their initial recruitment date divided by 365.25 and adding this to age at recruitment as a decimal value. Recruitment age in the CKB is already provided as a decimal value.Model benchmarkingWe compared the performance of six different machine-learning models (LASSO, elastic net, LightGBM and three neural network architectures: multilayer perceptron, a residual feedforward network (ResNet) and a retrieval-augmented neural network for tabular data (TabR)) for using plasma proteomic data to predict age. For each model, we trained a regression model using all 2,897 Olink protein expression variables as input to predict chronological age. All models were trained using fivefold cross-validation in the UKB training data (n = 31,808) and were tested against the UKB holdout test set (n = 13,633), as well as independent validation sets from the CKB and FinnGen cohorts. We found that LightGBM provided the second-best model accuracy among the UKB test set, but showed markedly better performance in the independent validation sets (Supplementary Fig. 1).LASSO and elastic net models were calculated using the scikit-learn package in Python. For the LASSO model, we tuned the alpha parameter using the LassoCV function and an alpha parameter space of [1 × 10−15, 1 × 10−10, 1 × 10−8, 1 × 10−5, 1 × 10−4, 1 × 10−3, 1 × 10−2, 1, 5, 10, 50 and 100]. Elastic net models were tuned for both alpha (using the same parameter space) and L1 ratio drawn from the following possible values: [0.1, 0.5, 0.7, 0.9, 0.95, 0.99 and 1].The LightGBM model hyperparameters were tuned via fivefold cross-validation using the Optuna module in Python48, with parameters tested across 200 trials and optimized to maximize the average R2 of the models across all folds.The neural network architectures tested in this analysis were selected from a list of architectures that performed well on a variety of tabular datasets. The architectures considered were (1) a multilayer perceptron; (2) ResNet; and (3) TabR. All neural network model hyperparameters were tuned via fivefold cross-validation using Optuna across 100 trials and optimized to maximize the average R2 of the models across all folds.Calculation of ProtAgeUsing gradient boosting (LightGBM) as our selected model type, we initially ran models trained separately on males and females; however, the male- and female-only models showed similar age prediction performance to a model with both sexes (Supplementary Fig. 8a–c) and protein-predicted age from the sex-specific models were nearly perfectly correlated with protein-predicted age from the model using both sexes (Supplementary Fig. 8d,e). We further found that when looking at the most important proteins in each sex-specific model, there was a large consistency across males and females. Specifically, 11 of the top 20 most important proteins for predicting age according to SHAP values were shared across males and females and all 11 shared proteins showed consistent directions of effect for males and females (Supplementary Fig. 9a,b; ELN, EDA2R, LTBP2, NEFL, CXCL17, SCARF2, CDCP1, GFAP, GDF15, PODXL2 and PTPRR). We therefore calculated our proteomic age clock in both sexes combined to improve the generalizability of the findings.To calculate proteomic age, we first split all UKB participants (n = 45,441) into 70:30 train–test splits. In the training data (n = 31,808), we trained a model to predict age at recruitment using all 2,897 proteins in a single LightGBM18 model. First, model hyperparameters were tuned via fivefold cross-validation using the Optuna module in Python48, with parameters tested across 200 trials and optimized to maximize the average R2 of the models across all folds. We then carried out Boruta feature selection via the SHAP-hypetune module. Boruta feature selection works by making random permutations of all features in the model (called shadow features), which are essentially random noise19. In our use of Boruta, at each iterative step these shadow features were generated and a model was run with all features and all shadow features. We then removed all features that did not have a mean of the absolute SHAP value that was higher than all random shadow features. The selection processes ended when there were no features remaining that did not perform better than all shadow features. This procedure identifies all features relevant to the outcome that have a greater influence on prediction than random noise. When running Boruta, we used 200 trials and a threshold of 100% to compare shadow and real features (meaning that a real feature is selected if it performs better than 100% of shadow features). Third, we re-tuned model hyperparameters for a new model with the subset of selected proteins using the same procedure as before. Both tuned LightGBM models before and after feature selection were checked for overfitting and validated by performing fivefold cross-validation in the combined train set and testing the performance of the model against the holdout UKB test set. Across all analysis steps, LightGBM models were run with 5,000 estimators, 20 early stopping rounds and using R2 as a custom evaluation metric to identify the model that explained the maximum variation in age (according to R2).Once the final model with Boruta-selected APs was trained in the UKB, we calculated protein-predicted age (ProtAge) for the entire UKB cohort (n = 45,441) using fivefold cross-validation. Within each fold, a LightGBM model was trained using the final hyperparameters and predicted age values were generated for the test set of that fold. We then combined the predicted age values from each of the folds to create a measure of ProtAge for the entire sample. ProtAge was calculated in the CKB and FinnGen by using the trained UKB model to predict values in those datasets. Finally, we calculated proteomic aging gap (ProtAgeGap) separately in each cohort by taking the difference of ProtAge minus chronological age at recruitment separately in each cohort.Recursive feature elimination using SHAPFor our recursive feature elimination analysis, we started from the 204 Boruta-selected proteins. In each step, we trained a model using fivefold cross-validation in the UKB training data and then within each fold calculated the model R2 and the contribution of each protein to the model as the mean of the absolute SHAP values across all participants for that protein. R2 values were averaged across all five folds for each model. We then removed the protein with the smallest mean of the absolute SHAP values across the folds and computed a new model, eliminating features recursively using this method until we reached a model with only five proteins. If at any step of this process a different protein was identified as the least important in the different cross-validation folds, we chose the protein ranked the lowest across the greatest number of folds to remove. We identified 20 proteins as the smallest number of proteins that provide adequate prediction of chronological age, as fewer than 20 proteins resulted in a dramatic drop in model performance (Supplementary Fig. 3d). We re-tuned hyperparameters for this 20-protein model (ProtAge20) using Optuna according to the methods described above, and we also calculated the proteomic age gap according to these top 20 proteins (ProtAgeGap20) using fivefold cross-validation in the entire UKB cohort (n = 45,441) using the methods described above.Statistical analysisAll statistical analyses were carried out using Python v.3.6 and R v.4.2.2. All associations between ProtAgeGap and aging biomarkers and physical/cognitive function measures in the UKB were tested using linear/logistic regression using the statsmodels module49. All models were adjusted for age, sex, Townsend deprivation index, assessment center, self-reported ethnicity (Black, white, Asian, mixed and other), IPAQ activity group (low, moderate and high) and smoking status (never, previous and current). P values were corrected for multiple comparisons via the FDR using the Benjamini–Hochberg method50.All associations between ProtAgeGap and incident outcomes (mortality and 26 diseases) were tested using Cox proportional hazards models using the lifelines module51. Survival outcomes were defined using follow-up time to event and the binary incident event indicator. For all incident disease outcomes, prevalent cases were excluded from the dataset before models were run. For all incident outcome Cox modeling in the UKB, three successive models were tested with increasing numbers of covariates. Model 1 included adjustment for age at recruitment and sex. Model 2 included all model 1 covariates, plus Townsend deprivation index (field ID 22189), assessment center (field ID 54), physical activity (IPAQ activity group; field ID 22032) and smoking status (field ID 20116). Model 3 included all model 3 covariates plus BMI (field ID 21001) and prevalent hypertension (defined in Supplementary Table 20). P values were corrected for multiple comparisons via FDR.Functional enrichments (GO biological processes, GO molecular function, KEGG and Reactome) and PPI networks were downloaded from STRING (v.12) using the STRING API in Python. For functional enrichment analyses, we used all proteins included in the Olink Explore 3072 platform as the statistical background (except for 19 Olink proteins that could not be mapped to STRING IDs. None of the proteins that could not be mapped were included in our final Boruta-selected proteins). We only considered PPIs from STRING at a high level of confidence (>0.7) from the coexpression data.SHAP interaction values from the trained LightGBM ProtAge model were retrieved using the SHAP module20,52. SHAP-based PPI networks were generated by first taking the mean of the absolute value of each protein–protein SHAP interaction score across all samples. We then used an interaction threshold of 0.0083 and removed all interactions below this threshold, which yielded a subset of variables similar in number to the node degree >2 threshold used for the STRING PPI network. Both SHAP-based and STRING53-based PPI networks were visualized and plotted using the NetworkX module54.Cumulative incidence curves and survival tables for deciles of ProtAgeGap were calculated using KaplanMeierFitter from the lifelines module. As our data were right-censored, we plotted cumulative events against age at recruitment on the x axis. All plots were generated using matplotlib55 and seaborn56. The total fold risk of disease according to the top and bottom 5% of the ProtAgeGap was calculated by raising the HR for the disease by the total number of years comparison (12.3 years average ProtAgeGap difference between the top versus bottom 5% and 6.3 years average ProtAgeGap between the top 5% versus those with 0 years of ProtAgeGap).Ethics approvalUKB data use (project application no. 61054) was approved by the UKB according to their established access procedures. UKB has approval from the North West Multi-centre Research Ethics Committee as a research tissue bank and as such researchers using UKB data do not require separate ethical clearance and can operate under the research tissue bank approval. The CKB complies with all the required ethical standards for medical research on human participants. Ethical approvals were granted and have been maintained by the relevant institutional ethical research committees in the United Kingdom and China. Study participants in FinnGen provided informed consent for biobank research, based on the Finnish Biobank Act. The FinnGen study is approved by the Finnish Institute for Health and Welfare (permit nos. THL/2031/6.02.00/2017, THL/1101/5.05.00/2017, THL/341/6.02.00/2018, THL/2222/6.02.00/2018, THL/283/6.02.00/2019, THL/1721/5.05.00/2019 and THL/1524/5.05.00/2020), Digital and Population Data Service Agency (permit nos. VRK43431/2017-3, VRK/6909/2018-3 and VRK/4415/2019-3), the Social Insurance Institution (permit nos. KELA 58/522/2017, KELA 131/522/2018, KELA 70/522/2019, KELA 98/522/2019, KELA 134/522/2019, KELA 138/522/2019, KELA 2/522/2020 and KELA 16/522/2020), Findata (permit nos. THL/2364/14.02/2020, THL/4055/14.06.00/2020, THL/3433/14.06.00/2020, THL/4432/14.06/2020, THL/5189/14.06/2020, THL/5894/14.06.00/2020, THL/6619/14.06.00/2020, THL/209/14.06.00/2021, THL/688/14.06.00/2021, THL/1284/14.06.00/2021, THL/1965/14.06.00/2021, THL/5546/14.02.00/2020, THL/2658/14.06.00/2021 and THL/4235/14.06.00/2021), Statistics Finland (permit nos. TK-53-1041-17 and TK/143/07.03.00/2020 (previously TK-53-90-20) TK/1735/07.03.00/2021 and TK/3112/07.03.00/2021) and Finnish Registry for Kidney Diseases permission/extract from the meeting minutes on 4 July 2019.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Proteomic aging clock predicts mortality and risk of common age-related diseases in diverse populations

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models