Unique genetic and risk-factor profiles in clusters of major depressive disorder-related multimorbidity trajectories

Description of the training cohortsUK BiobankUnder application number 1602, we extracted data from the UK Biobank (UKB) database, which includes medical and phenotypic data of participants recruited from NHS patient registers of people aged 40–69 years41. Ethical approval was given by the National Research Ethics Service Committee North West–Haydock42, and all participants provided written informed consent. All procedures were conducted in accordance with the Declaration of Helsinki.To identify MDD-related clusters based on disease trajectories, 502,504 participants who had available disease onset information for 1,127 ICD-10 categories were included (for details, see the “Cross-cohort disease categories and relevance scores” section).
Quality control of GWAS data in UKB data
Our genomic quality control (QC) methods were detailed previously43, but in the present analyses, we did not restrict participants according to the availability of complete dietary data.
Specifically, we selected participants with White British ancestry (UKB data field 22006, defined both by self-report and genetic ancestry) and without putative sex chromosome aneuploidy (data field 22019). We used v3 genetic data of UKB with genotyped variants and, when genotyped variants were not present, we used imputed variants as well and positioned them according to the GRCh37/hg19 genome assembly. Variant QC consisted of several steps. First, multiallelic variants as well as variants with a minor allele frequency (MAF) < 0.01 were excluded, retaining only single-nucleotide polymorphisms (SNPs). For imputed SNPs, both info and certainty parameters had to be at least 0.9. Furthermore, SNPs and participants were excluded according to the missingness rate (in an iterative manner, with cut-off points at 0.1, 0.05 and 0.01), and SNPs violating Hardy-Weinberg equilibrium (p < 1 × 10-5) were excluded. Before further calculations for participant filtering, linkage disequilibrium (LD) pruning, with an R2 of 0.2, was applied to SNPs. The maximal set of unrelated individuals was selected44 (data field 22020), and a kin-cut-off of 0.044 was applied. Finally, a sex check and heterozygosity outlier detection were performed, as described previously45.
For the GWAS analyses, we selected participants who did not withdraw their consent before February 2022 and did not have missing data on sex, age, or genotyping array. To control for population stratification, principal component analysis was performed with the final set of participants and with the SNP subset after the LD pruning (described above).
Catalan Health Surveillance SystemSince 2011, the Catalan Health Surveillance System (CHSS) has collected detailed information on healthcare utilization from the entire population of Catalonia (northeastern Spain; 7.5 million inhabitants). The CHSS integrates the information contained in the Minimum Basic Dataset for Healthcare Units registry provided by 63 hospitals, 49 mental health centres, 370 primary care teams and 72 long-term care centres every 6 months. The CHSS assembles information on the use of public healthcare resources, pharmacological treatments, socioeconomic and educational status, psychological health and other billable healthcare costs, such as nonurgent medical transportation, ambulatory rehabilitation, domiciliary oxygen therapy, and dialysis46.From the 7.5 million individuals documented in the CHSS, we considered only registry data from all citizens living in the integrated health district of Barcelona-Esquerra (“AISBE”) between 2011 and 2019 (N = 645,913), with a mainly White European ancestry, as input to identify MDD-related clusters, extracting over 42 million diagnostic codes recorded between 1913 and 2019. Notably, approximately 50% of the records were from the period after 2012, when the Catalan health system underwent digitization and implemented electronic medical records. Conversely, only 2 million records were available from before 2000, with approximately 200,000 records available from before 1950.Finnish Institute for Health and Welfare (THL)For cluster analysis, data from 41,092 participants in Finnish population surveys were included47; these surveys included Finrisk 1992 (N = 5019), Finrisk 1997 (N = 7130), Finrisk 2002 (N = 7207), Finrisk 2007 (N = 4635), Finrisk 2012 (N = 5396)48, Health 2000/2011 (N = 6004) and FinHealth49 2017 (N = 5074) (https://urn.fi/URN:ISBN:978-952-343-449-3). After excluding related individuals (IBD > 0.2), 30,961 participants were retained from the Finnish population surveys47 as follows: Finrisk 1997 (N = 6723), Finrisk 2002 (N = 5698), Finrisk 2007 (N = 4635), Finrisk 2012 (N = 3078)48, Health 2000/2011 (N = 5944) and FinHealth 2017 (N = 4883). These participants, aged 20–100 years, were chosen at random from the Finnish population and represented different parts of Finland. For GWAS data used from THL cohorts, see the “Quality control of GWAS data from the FinnGen and THL cohorts” section.Description of validation cohortsFinnGen projectWe used data from the FinnGen project19 (https://www.finngen.fi/en) from data freeze 10 (DF10; excluding THL cohorts) to generate the MDD-related clusters in an independent cohort from the Finnish population. In brief, FinnGen is a public‒private project aiming to collect genotype data from half a million Finnish people and combine these data with data from various health registries. The participants consist of legacy individuals recruited before the start of the FinnGen project and prospective individuals; these latter participants were recruited on a voluntary basis during hospital visits if the patient provided consent for their data to be entered in the biobank.The THL cohorts and FinnGen cohort contain disease information collected in the following registries: Causes of death (STAT, 1969), Register of Primary Health Care Visits, HILMO (2011), Care Register for Health Care inpatient visits, HILMO (THL, 1969), Care Register for Health Care, specialist outpatient visits, HILMO (THL, 1998), Finnish Cancer Registry (CANC, 1953), Cervical cancer screening (THL, 1991), Breast cancer screening (THL, 1992), and the Finnish Registry for Kidney Diseases (1964).In total, FinnGen DF10 consists of 430,897 participants with genotype data; after excluding the THL cohorts, 385,640 participants remained for cluster analysis.
Quality control of GWAS data from the FinnGen and THL cohorts
The genotyping of FinnGen participants was performed on a Thermo Fisher axiom custom array consisting of 736,145 probes for 655,793 genetic markers. Processing of samples included removing samples where the genetic sex did not match the participant-reported sex in the registries, samples with missing variant information >0.02, samples with excess heterozygosity in common variants (allele frequency >0.05) and samples with excess relatedness to other samples (IBD > 0.1). The processing of variants depended on whether the variant was used in imputation. Quality control included removing variants if allele frequency in the panel was <0.001 (for imputation QC only), removing variants where the allele frequency differed significantly among panels, removing a variant from all batches (FinnGen chip data and legacy data, processed separately) if HWE p < 10-10 across all batches; removing any batch missing >0.03 of data (for legacy samples, the missingness threshold was 0.05 due to the exclusion of too many variants for imputation purposes), and removing batches where more than 15% of the batches had missingness >0.04. Finally, variants within a batch were removed if the p-value for HWE was >10-6 or if the missingness rate was >0.02. The legacy participants were genotyped on various generations of Illumina GWAS arrays. The Sisuv4 reference panel was used to impute an additional 20,175,454 genetic markers. Information on the generation of the imputation panel and the QC steps used to produce the imputed genotypes is available elsewhere (https://finngen.gitbook.io/finngen-analyst-handbook/finngen-data-specifics/genotype-data/imputation-panel/sisu-v4-reference-panel).
Finrisk cohorts were genotyped at the Sanger Institute, Hinxton, UK; FIMM, Helsinki, Finland and Broad Institute, Cambridge, MA, USA using Illumina HumanCoreExome-12v1, Illumina HumanCoreExome-24v1, Illumina HumanOmniExpress-12v1, Illumina Human610-Quadv1, and Illumina GSAMD-24v1-0_20011747_A1 arrays. The Health 2000/2011 cohorts were genotyped at the Sanger Institute, Hinxton, UK; FIMM, Helsinki, Finland and Broad Institute, Cambridge, MA, USA using IlluminaHuman610K and Human610-Quadv1; Illumina HumanCoreExome-24-v1; and Broad_GWAS_supplemental_15061359_A1 genotyping arrays, respectively. The FinHealth 2017 cohort was genotyped at Thermo Fisher Scientific, San Diego, CA, USA, using the Affymetrix Axiom FinnGen1 array. Before imputation, variants with call rate <0.98, HWE p < 10-6, and minor allele count (MAC) < 3 were removed. For THL cohorts, the Sisuv3 reference panel was used to impute an additional 20,175,454 genetic markers. Information on the generation of the imputation panel and the QC steps to produce the imputed genotypes is available elsewhere (https://finngen.gitbook.io/finngen-analyst-handbook/finngen-data-specifics/genotype-data/imputation-panel/sisu-v3-reference-panel).
Study of Health in PomeraniaThe Study of Health in Pomerania (SHIP) is a general-population-based research project on adult residents in northeastern Germany50. In this study, we analysed data from the SHIP-START cohort; in this cohort, at baseline, 4308 White European participants were recruited between 1997 and 2001. To date, three regular follow‐ups have been carried out (SHIP-START-1/2/3) as well as a detailed assessment of life events and mental disorders (SHIP-LEGEND) from 2007 to 2010, including 2400 participants from the baseline SHIP-START-0 cohort51.For cluster analysis, 1449 participants who took part in SHIP-START-3 and SHIP-LEGEND and had available baseline information from SHIP-START-0 were included. Regarding the age of onset of chronic diseases (Supplementary Data 2), self-reported results were used from baseline and follow-up data. The age of onset for diseases in the F section of ICD-10 codes (F32, F33, F17, F41, F43, F40, and F45) was determined from a combination of self-reported diagnoses from SHIP-LEGEND data and data from the health insurance system that have been collected since the end of 2003. Finally, information for 37 diseases was available. Hereafter, data from the SHIP-START cohort are referred to as SHIP data.
Quality control of GWAS data in the SHIP cohort
The SHIP-START-0 participants were genotyped using the Affymetrix Genome-Wide Human SNP Array 6.0. Hybridization of genomic DNA was performed in accordance with the manufacturer’s standard recommendations. Genetic data were stored using the database Caché (InterSystems). Genotypes were determined using the Birdseed2 clustering algorithm. For QC purposes, several control samples were added. At the chip level, only participants with a genotyping rate on QC probe sets (QC call rate) of at least 86% were included. Finally, all arrays had a sample call rate >92%. The overall genotyping efficiency was 98.55%. Imputation of genotypes was performed using the HRCv1.1 reference panel and the Eagle and minimac3 software implemented in the Michigan Imputation Server for prephasing and imputation, respectively. SNPs with an HWE p < 0.0001 or a call rate <0.95 as well as monomorphic SNPs were removed before imputation.
Ethics statementsThis study, including both the data collection and the current analyses, has received ethical approval from appropriate institutional review boards for all involved cohorts. Specifically, the analysis involved data from the following cohorts: UK Biobank (UKB), Catalan Health Surveillance System (CHSS), Finnish Institute for Health and Welfare (THL), FinnGen, and Study of Health in Pomerania (SHIP). Comprehensive ethical approvals were obtained for each of these cohorts, ensuring that all procedures followed were in accordance with the ethical standards of the responsible committee and with the Helsinki Declaration.Furthermore, all participants in the study provided written informed consent. Detailed information regarding the ethical approvals, including the specific committees and approval numbers, is available in the Supplementary Information.Identification of MDD-related clusters based on disease trajectoriesAssessing diseases strongly relevant to MDDWe used a Bayesian network-based Markov Chain Monte Carlo (BN-MCMC) method to assess the strongly relevant variables with respect to our target variable (MDD). Bayesian networks (BNs) use directed acyclic graphs (DAGs) to represent multivariate dependencies and conditional independencies among the variables. The nodes in these graphs represent variables, and the edges represent direct relationships between the corresponding nodes. Assessments of the complex structure of the variables are called learning the structure of the BN based on the observed data. However, in most cases, there are many DAGs with nonnegligible a posteriori probabilities (i.e., the best network has many alternatives that are almost as probable as the best network). Even in these cases, there are usually certain structural features, such as the strong relevance of two variables, which can be extracted reliably.Strongly relevant variables statistically isolate the target variable from all other variables. Therefore, strong relevance is a different concept than a standard pairwise association. First, if the dependency of disease A on the target disease B is indirect (e.g., due to mediation through a third disease C), then A and B are associated but not strongly relevant to each other. Second, if A has no direct effect on B, but A and C interact with each other to affect B (e.g., disease A does not cause disease B, and vice versa, but the presence of diseases A and B together causes C), then A is not associated with B but is strongly relevant due to the interactional effect. Therefore, strong relevance indicates either a direct/nonmediated association or an interactional relevance. Below, we refer to a variable’s probability of strong relevance with respect to MDD as the variable’s relevance score.In the Bayesian learning framework, we can estimate the posterior probability that two variables are strongly relevant to each other (i.e., they have a direct influence on each other) as follows:$$P ({{{\rm{strongly}}}}\mbox{-}{{{\rm{relevant}}}}(X,Y)|D)={\sum}_{G}{{P}}(G|{{{\rm{D}}}})\cdot {{I}}({{{{\rm{Edge}}}}}_{G}({{X}}\to {{Y}})\, or \, \\ {{{{\rm{Edge}}}}}_{G}({{Y}}\to {{X}})\,or\,\exists Z:({{{{\rm{Edge}}}}}_{{{G}}}({{Y}} \to {{Z}})\,and\,{{{{\rm{Edge}}}}}_{G}({{X}}\to {{{\rm{Z}}}})))$$
(1)
where G represents a BN structure (a graph), D is the dataset; I(.) denotes the indicator function, which is 1 if the property holds and 0 otherwise; and \({{{{\rm{Edge}}}}}_{G}\left(X\to Y\right)\) means that an edge points from node X to Y in the G graph. Specifically, the indicator function yields 1 if there is a direct edge from X to Y or from Y to X or if there is a common child node Z of nodes X and Y.Note, that in our methodology, the directed arrows in the Bayesian network represent direct probabilistic relationships between diseases. This means that the presence of one disease (e.g., MDD) directly influences the probability distribution of another. To assess the strong relevance of each disease to MDD, we focused on the concept of the Markov Boundary, which is the smallest set containing all variables carrying information about a target variable that cannot be obtained from any other variable. In other words, we cannot drop any variable from this set without losing information. By examining the diseases that are in the Markov Boundary of the target variable, we calculate the strength of their probabilistic relationship with the target variable. More specifically, within the Bayesian statistical framework employed in our study, we compute the posterior probability of each variable being within the Markov Boundary of MDD (i.e., the probability of their strong relevance with respect to MDD).We also note that in our Bayesian network framework, we focus on capturing structural probabilistic relationships between variables rather than quantifying interaction terms that occur in regression models. Although these interactions are quantified at the parametric level in Bayesian networks—through the conditional probability distributions of variables given others—our analyses primarily aimed to elucidate the structural relationships by performing exact Bayesian averaging over the parametric level rather than quantifying these interaction effects directly.It should be also noted that while the relationships in our Bayesian network are direct and unmediated by other diseases, they do not necessarily imply causation. This directness refers to the absence of intermediate variables within the network’s model structure, distinguishing these relationships from mere correlations at the abstraction level defined by the entire set of variables in the analysis. However, direct probabilistic relationships in the Bayesian network are derived from observational data, not from interventional studies that manipulate one variable to directly observe its effect on another. Without the ability to control or manipulate the conditions, the relationships might still be influenced by unobserved confounding factors. The direct relationships in the network are based on the strongest statistical dependencies observed in the data, but these dependencies alone do not fulfil all criteria required to establish causality, such as eliminating all potential confounders and demonstrating that the relationship is not reversible.The posterior probabilities \(P\left({G|D}\right)\) are estimated using a DAG-based MCMC simulation. We applied the Metropolis-coupled Markov Chain sampler with a burn-in period of 2 × 106 steps and then collected 107 samples (i.e., network structures). We restricted the space of the possible structures by limiting the number of parents per node to 8. Convergence diagnostic testing using Geweke52 scores indicated that the MCMC chains had converged for 618 out of 621 (99.5%) of the posterior probabilities of the variables’ strong relevance, with their z-scores within the acceptable range of −2 to 2, suggesting overall convergence of the chains.We modelled the participant trajectories using an inhomogeneous dynamic BN to utilize the disease onset information. More specifically, we discretized the first onset time of the diseases to cumulative time intervals ([0–20], [0–40], [0–60], and [0–70]) and transformed the disease onsets into binary variables that had a value of 1 if the disease was diagnosed in the given time interval and 0 otherwise (Fig. 1B). The dynamic BN accounts for censoring of the participants by including only those participants in each cumulative time interval who have complete disease onset information up to the end of that interval. Then, we separately estimated the strong relevance of all variables with respect to either F32 or F33 (ICD-10 disease categories jointly defining MDD) for each time interval t. The variables from time interval t-1 were also included in the model, but variables in t − 1 could only act as parent nodes, i.e., no edge could point to a variable in t − 1. See Fig. 1C for a graphical illustration of this method.Cross-cohort disease categories and relevance scoresAs a preliminary step, we determined the set of cross-cohort disease variables as follows. (1) First, for each cohort (UKB, CHSS, and THL), we filtered diseases (according to three-character ICD-10 disease categories) with a prevalence >1% either in the whole cohort or in the subset of depressed participants (i.e., patients diagnosed with either F32 or F33). The primary objective of this pre-filtering step was to exclude rare disorders, as our goal was to identify general multimorbidity trajectory clusters that are broadly applicable. Consequently, this initial filtering led to differing numbers of diseases being considered across each cohort, namely 266 disorders for UKB, 356 for CHSS, and 339 for THL. (2) Next, we estimated the strong relevance of all such diseases with respect to MDD by learning the structural features of the formerly described inhomogeneous dynamic BN. We performed this analysis separately for each cohort. (3) Finally, we selected diseases that had a posterior probability of strong relevance with respect to MDD higher than 0.5 in at least one time interval for at least one cohort, selecting only those disease variables that were consistently available across all cohorts, allowing for uniform analysis across all datasets. This preliminary filtering procedure aimed to gather the broadest possible set of potentially relevant diseases, resulting in 86 cross-cohort disease categories. The prevalence rates and summary statistics of the first onsets of these cross-cohort disease categories are shown in Supplementary Data 2 for each cohort.Finally, we performed the same analyses using only the cross-cohort disease variables together with the sex and household income status variables of the samples. This resulted in cohort-specific relevance scores for each variable, from which we defined cross-cohort relevance scores by computing a linear combination of the cohort-specific relevance scores for each time interval by applying uniform weights on the cohorts. The cross-cohort relevance scores are shown in Supplementary Data 3. Computed in this way, the cross-cohort relevance score of a variable corresponds to the expected probability that the variable is strongly relevant with respect to MDD in a given time interval.Clustering of participantsBased on the cross-cohort relevance scores, we computed the weighted direct MDD-related multimorbidity scores for each participant in each cohort and for each time interval. The score for the ith participant in the tth time interval is computed as follows:$${{{\rm{multimorbidity}}}}\mbox{-}{{{{\rm{score}}}}}^{(t)}(i)=\! {\sum}_{d}I({{{\rm{onset}}}}(i,\, d){\epsilon }t)\times {{{\rm{relevance}}}}\mbox{-}{{{{\rm{score}}}}}^{(t)}(d)$$
(2)
where d represents the cross-cohort diseases, I(.) denotes an indicator function that yields 1 if the first onset of disease d for the i–th sample occurs in the t–th time interval and 0 otherwise; and relevance‑score(t)(d) denotes the cross-cohort relevance score of disease d in the tth time interval. These weighted direct MDD-related multimorbidity scores defined the 4-dimensional space of the samples that we used to cluster the participants.Finally, we clustered all participants &&using the k-means clustering algorithm in the 4-dimensional space defined by the weighted direct MDD-related multimorbidity scores. More specifically, the clusters were determined based on the participants with complete observed multimorbidity scores, i.e., participants older than 70 years. In younger participants, one or more multimorbidity scores were not available because there were no observations of their future disease onset. However, based on their partial scores, they were assigned to clusters by allocating them to the cluster with the nearest cluster centre. The number of clusters was determined by manual investigations based on expert knowledge with the help of various cluster metrics (such as the silhouette score of the resulting clusters). See Fig. 1 for a graphical overview of the method and Supplementary Methods for further details on the investigated cluster configurations.The likelihood of cluster membership for the i–th sample and for the j–th cluster is defined as:$${{{{\rm{likelihood}}}}}_{j}(i)=\exp \left(-{||}{p}_{i}-{c}_{j}{||}\right)$$
(3)
where \({p}_{i}\) and \({c}_{j}\) represent the point that correspond to the ith sample and the j–th cluster’s cluster centre, respectively, in the space defined by the multimorbidity scores, and \({{||p}}_{i}-{c}_{j}{||}\) is their Euclidean distance.The posterior probability of cluster membership for the ith sample and the jth cluster was the normalized likelihood shown below:$${P}_{j}\left(i\right)=\frac{\exp \left(-{{{\rm{||}}}}{p}_{i}-{c}_{j}{{{\rm{||}}}}\right)}{{\sum }_{k}\exp \left(-{{{\rm{||}}}}{p}_{i}-{c}_{k}{||}\right)}$$
(4)
To control for uncertain participant trajectories in the following analyses, we excluded participants for whom the clustering algorithm demonstrated low confidence across all clusters. Specifically, we excluded participants who were both under 60 years of age and whose maximum posterior membership probability did not exceed 0.25 for any cluster. This threshold was chosen to remove individuals for whom the algorithm could not confidently assign a predominant cluster, thereby focusing our analysis on participants with more definitive cluster memberships. This exclusion criterion resulted in a subset of N = 364,008 participants. This subset was used for comparing clusters and deriving age-specific differences and was also the base set for genetic analysis.In the case of GWAS and non-genetic risk-factor profiling analysis, the posterior log-odds of the cluster memberships were used as target variables as follows:$${{{{\rm{Posterior}}}} \, {{\mathrm{log\; odds}}}}_{j}\left(i\right)={{{\rm{ln}}}}\frac{{P}_{j}\left(i\right)}{1-{P}_{j}\left(i\right)}$$
(5)
The posterior probability of cluster membership was used to calculate the disease profile of the clusters.Our methodology employs a privacy-preserving federated approach to derive the MDD-related clusters across multiple cohorts without sharing individual-level data, making it suitable for collaborative studies where data sharing is restricted. Each participating site independently computes relevance scores for diseases, which are then aggregated to create cross-cohort relevance scores, ensuring that only non-identifiable, summarized information is exchanged between sites. Multimorbidity scores for each participant are calculated by aggregating cross-cohort relevance scores for the diseases they have experienced (see Eq. (2)). These scores are then compiled into counts of occurrences at each site. The final clustering is performed using these aggregated counts, thereby ensuring the confidentiality of individual data throughout the process.SoftwareInference over Bayesian network structures was performed with an in-house developed software called BN-BMLA53. All other computations were performed in R statistical software (version 4.1.1)54 or Python (version 3.8). Clusters can be computed with a command line R script that is available online: https://github.com/gezsi/mdd-clustering.Disease profile of MDD-related clustersWe used weighted Cox regression to determine the disease outcomes in the various clusters (i.e., the hazard ratio of cluster membership regarding disease occurrence) independently for each cohort. Specifically, for each cluster, we constructed Cox proportional hazard models, where the independent variable for a specific individual was a dummy variable created in the following way. We counted each participant twice, summing to a weight of 1. First, we set the value of the dummy variable to 1 and weighted this sample by the posterior probability of cluster membership. Next, we set the value of the dummy variable to 0 and used weight for this sample equal to 1 minus the probability of cluster membership. The covariates were sex, household income (if available), and the normalized birth year (in the case of the UKB cohort). The dependent variable was disease onset. Participants were right censored for a given target disease at their age if the disease was not diagnosed. We calculated separate models for each cross-cohort disease. P-values of the cluster membership variables were adjusted separately for each cohort using the Benjamini‒Hochberg method.In addition, we calculated weighted Kaplan‒Meier estimates of MDD-free survival in the various clusters in each cohort. We weighted each participant in each cluster with the corresponding posterior probability of cluster membership.Genetic analysesGenome-wide association studyFollowing site-specific QC measures (see cohort descriptions), Plink 2.055 (https://www.cog-genomics.org/plink/2.0/) was used to perform linear regression models to assess the direct effect of each remaining genetic variant on the seven MDD-related clusters that reflected the posterior log-odds of cluster membership. All analyses were adjusted for age, sex, the first ten genetic principal components and site-specific variables (genotyping array in the UKB cohort, geographical region in the THL cohorts). Age was included in the model as a nonlinear variable using cubic splines with knots at ages of 40 and 60 years (R package splines v4.1.1, function bs). In particular, because of the age span of participants, only the knot at 60 years could be applied in the UKB cohort. Continuous predictor and outcome variables were standardized in the analyses. We employed additive genetic models to assess the contribution of individual genotypes to the dependent variable. We excluded individuals for whom the clustering algorithm lacked sufficient confidence in assigning a predominant cluster, thus concentrating our analysis on those with more clearly defined cluster memberships. More specifically, participants under 60 years of age with a maximum posterior membership probability of no more than 0.25 for any cluster were excluded. Genetic data were available in the UKB (N = 249,167) and THL (N = 23,786) cohorts, which were both used to generate the MDD-related clusters. Additionally, genetic data were available in two completely independent cohorts (FinnGen, N = 277,252 and SHIP, N = 1126). We therefore treated the UKB sample as the discovery sample and the latter samples as replication samples. For replication of GWAS loci, a nominal significance level (p < 0.05) was assumed.The FinnGen GWAS analyses were performed with Regenie56 (version 2.2.4) instead of PLINK 2.0 due to computational constraints in the FinnGen Sandbox pipeline. The parameters for the Regenie analyses were as follows: step 1, bsize 1000; step 2, –bsize 200, –bt false, –apply-rint false, –firth, –approx, –pThresh 0.01, –test additive and –firth-se.Additional GWAS analysis of the UKB cohort using logistic regression analysis of the binarized presence/absence of disease onset was performed to compare the results of MDD-related multimorbidity clusters to the genetic results of all 86 cross-cohort disease categories used to inform the clusters. All filters and settings were the same as for the cluster membership analysis in the UKB cohort detailed above.Post-GWAS analysisTo assess the impact of SNP results on biological processes, several post-GWAS tools were applied that extract information regarding significant loci, genes, and pathways and report genetic correlations with other phenotypes of interest based on their GWAS summary statistics.The GWAS summary statistics for all seven MDD-related clusters for each cohort were first processed with FUMA57 to identify lead SNPs and significant loci. The maximum p-value of lead SNPs was set to 5 × 10-8, r2 ≥ 0.6 was set as the threshold for independent significant SNPs, and the maximum distance between LD blocks of independent significant SNPs was set to 250 kb. Furthermore, MAGMA (v1.10)58 gene-level analysis was performed to identify putative significant genes using a SNPwise-multi model. We defined the SNP set of each gene including ±10 kb downstream or upstream of the gene, respectively. We used the 1000 Genomes European panel data to evaluate the LD between SNPs. We employed Holm’s correction method to adjust the p-values of the genes. We assessed the significance of the over-enrichment of MDD-associated genes (identified by Howard et al. 5) within the genes associated with each cluster by conducting one-sided hypergeometric tests to evaluate whether the association between the cluster genes and MDD genes was stronger than expected by chance. Additionally, we employed Gene Set Enrichment Analysis (GSEA) using the fgsea R package (v1.18)59 to assess the significance of enrichment of these MDD-associated genes across the clusters. To account for multiple comparisons, we applied Holm’s correction method to adjust the p-values derived from these tests. In addition, we used the g:Profiler R package (v0.2.3, database version: e110_eg57_p18_4b54a898)60 for functional enrichment analysis of each cluster’s sets of significant genes. We used Gene Ontology (excluding IEA evidence codes) and KEGG biological pathway data sources. We applied the g:SCS method for p-value adjustment, and the p < 0.01 threshold was used to indicate statistical significance. Using the variety of analysis tools included in the Complex-Traits Genetics Virtual Lab61 (CTG-VL; https://genoma.io), we additionally assessed the genetic heritability of the clusters, genetic correlations among clusters and genetic correlations with other phenotypes using the LD score regression method (LDSC v1.01)62. Genetic correlation is a quantitative statistical parameter reflecting the genetic relationship between two traits. This measure can reflect the pleiotropic action of genes or the correlation between causal loci in two traits, which is especially important for polygenic traits.Polygenic risk scoresA polygenic risk score (PRS) is a genetic measurement that sums an individual’s risk-conferring alleles weighted by their estimated effect size for a specific phenotype or disease. The PRS employed in this study was calculated using PRS-CS (v1.0.0), a method that utilizes a high-dimensional Bayesian regression framework and places a continuous shrinkage (CS) prior on SNP effect sizes using GWAS summary statistics and an external linkage disequilibrium (LD) reference panel63. Here, the original effect sizes were taken from the UKB GWAS on cluster membership for all seven clusters. The LD reference panel was constructed using a European subsample of the UK Biobank44. For the remaining parameters, the default options implemented in PRS-CS were adopted. The PRSs for membership in Clusters 1–7 were calculated in the GWAS samples of the THL and SHIP cohorts. PRSs in the SHIP cohort were correlated with the cluster probabilities, whereas in the THL cohorts, due to the larger sample size, regression analyses between two factors could be performed adjusted for age, sex, batch, region and cohort (Supplementary Data 1).Network-based analysis of pleiotropyWe assessed pleiotropy among the clusters and MDD at the level of functional modules by the following procedure. First, we defined an initial evidence score for the top genes in each cluster as their negative log-transformed adjusted p-value. Next, we applied the Personalized PageRank (also known as Random Walk with Restart, RWR) network propagation algorithm using the interactome network based on STRING (https://string-db.org/; v11.5, filtered to high confidence edges with a combined score cut-off >0.7) to score all protein-coding genes initialized from the top genes of each cluster. Then, we selected the highest-scoring genes in the resulting score rankings by the Kneedle algorithm64 and identified their nonoverlapping modules using spectral clustering. Next, we similarly initiated network propagation from seed genes significantly associated with MDD based on the gene-level MAGMA analysis results of Howard et al. 5. Finally, we determined those cluster-specific modules where the propagated MDD scores’ sum was statistically significantly higher than according to a null model based on degree-aware permutations of the seed genes in the network. In all RWR experiments, we used a random restart probability of 0.5; however, in accordance with other studies65, the results were not sensitive to the parameter change.Non-genetic risk-factor profiles of MDD-related clustersLifestyle and physiological risk factors play a fundamental role in the probability of lifetime MDD. Therefore, it was essential to determine whether one or more of the clusters had a defining risk-factor profile that allowed it to be identified. Apart from basic descriptors such as age, sex, income, and qualifications (Table 1), lifestyle-related factors such as present and past smoking habits and alcohol intake and physiological descriptors such as body mass index, systolic and diastolic blood pressure, pulse rate, C-reactive protein level, and presence of insomnia were investigated. Additionally, neuroticism (as a common personality trait with depression), life stress (as a representation of major negative life events), and current depression score were used as psychological factors to characterize clusters (for availability and descriptive statistics of these descriptors in the individual cohorts, see Supplementary Data 1A).Linear regression models were constructed for each cluster with Python 3.8 using the statsmodels package (v0.13.1). The corresponding posterior probability log-odds for a given cluster was used as the dependent variable. Two types of regression models were generated: (1) “simple regression models” including one risk factor at a time with age and sex as covariates and (2) “complex regression models” including all available risk factors simultaneously in a single model. Simple regression models allow us to explore the individual effect of each factor, whereas the complex model enables the analysis of joint effects and other multivariate aspects. In the case of complex regression models, a k-nearest neighbour-based imputation method66 was applied to compute missing values, whereas in the case of simple regression models, only complete samples were used.Clustering with a subset of diseasesA natural question that arises is how accurate the clustering will be if not all cross-cohort diseases are available. To assess cluster analysis performance in various limited disease subsets, we recalculated the cluster membership based on various disease subsets of a given size, increasing from only one disease to all cross-cohort diseases: (1) randomly selected diseases; (2) greedily selected, increasingly expanded sets of diseases; (3) a null model based on random cluster membership probabilities; and (4) another null model defined by uniform cluster membership probabilities. We calculated four performance measures, namely, the accuracy and balanced accuracy of the hard clustering (i.e., assigning each individual to the cluster in which its membership probability is highest) and the mean absolute error and mean squared error of the posterior probabilities of cluster membership, averaged over 10,000 random individuals from the UKB cohort selected a hundred times. The greedy variable selection method was performed as follows. First, we selected a single disease for which the accuracy of cluster analysis (compared to the original clustering of the samples using all cross-cohort diseases) was the highest. Next, we selected the disease from the remaining set of diseases that had the highest accuracy along with the first disease. We began to expand this set, always adding the disease for which the increase in accuracy was the highest. Note that this procedure may result in a suboptimal choice of the best-performing disease subset of a given size. The results for this clustering procedure are provided in Supplementary Fig. S22.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles