Genome-wide large-scale multi-trait analysis characterizes global patterns of pleiotropy and unique trait-specific variants

Data and initial filteringWe collect 338 full summary level datasets published between 2007 and 2019 from the NIH Genome-Wide Repository of Associations Between SNPs and Phenotypes (GRASP). GRASP includes a wide range of phenotypes including anthropometric traits, biomarkers, blood cell levels, adipose volume, early growth traits, social science Indices, cardiometabolic diseases, psychiatric and neurological diseases, autoimmune diseases, etc. We further collect 20 summary-level datasets from GWAS Consortia for the traits included in GRASP but when the Consortia offer a study with larger sample sizes. See Supplementary Data 1 and 2 for a full list of datasets. Thus restricting the analysis to the GRASP traits allowed us to carry out the pleiotropic analysis of across a diverse set of traits with large available GWAS, but avoid including many overlapping traits such as those in UK Biobank.We apply several filtering steps sequentially (see Supplementary Fig. 1 for a flow chart): 1) remove datasets in which the genetics variants do not have genome-wide coverage (genotyped by exome array, metabochip and immunochip); 2) remove datasets that only report p-values without direction of effect; 3) remove studies with small sample size: for continuous traits, we remove studies with sample size <5,000; for binary traits, we remove studies with <5000 cases or <5000 controls; 4) remove studies where >25% individuals are of non-European ancestry; 5) remove duplicated traits: among the studies for the same trait, we keep the study with the largest sample size and discard the rest of them; 6) remove traits that are deterministic functions of other traits in our study, or other traits adjusted of covariates. After the filtering, 150 traits entered downstream analysis (see Supplementary Data 2 for all traits removed from analysis).LD score regression and further filteringWe applied LD score regression3,55 to estimate the heritability and genetic correlation across 150 traits. LD scores based on 1000 Genomes European data and the list of HapMap3 SNPs were downloaded from the LDSC GitHub repository (https://github.com/bulik/ldsc). Only SNPs in HapMap3 were used to calculate the LD score and perform the regression.To ensure the traits have a substantial genetic component, we remove 21 traits that do not have a heritability estimate that is significantly different from 0 (z statistic \(\, > \, 1.96\)). To further reduce the genetic overlaps of the traits, we remove 13 traits that have high genetic correlation (\({r}_{g}\)) with others. In short, if two traits have genetic correlation \(|{r}_{g}| \, > \, 0.95\), we remove the trait less enriched of genetic associations, quantified by the product of the sample size of the study (\(N\)) and the heritability of the trait (\({h}^{2}\)). The algorithm is as follows:

1.

Sort the traits in decreasing order of \(N{h}^{2}\).

2.

Start from the trait with the highest \(N{h}^{2}\), remove the traits that have \(|{r}_{g}| \, > \, 0.95\) with this trait.

3.

Proceed to the trait with the next highest \(N{h}^{2}\), repeat until no pairs of traits have \(|{r}_{g}| \, > \, 0.95\).

After the above filtering steps, we retain a final list of 116 traits for statistical analysis (Supplementary Data 1). We further restricted to the variants that are available for \(\ge 50\) traits, and in 1000 Genomes Phase 3 European sample with minor allele frequency (MAF) \(\, > \, 0.01\). This leads to a total of 7,462,466 variants. Since our goal is not to identify causal variants, the discrepancy of reference panels across studies should not have a major impact.Statistical Analysis Using fastASSETThe first step to study the patterns of pleiotropy is to quantify the level of pleiotropy across the genome. Previous studies often have quantified the pleiotropy of a SNP by counting the number of traits that reach \(p \, < \, 5\times {10}^{-8}\) in individual trait analysis. This approach, however, is likely to miss many weaker associations. Here we described fastASSET, an accelerated version of the ASSET method which 1) detects SNPs associated with any trait in our collection (“global association”) and 2) reports a subset of traits associated with each SNP.The original ASSET is developed based on meta-analysis across different phenotypes and searches for the subset of traits with maximum meta-analysis z-statistic7. However, since we are analyzing a large number of traits, the orignal ASSET which searches through all subsets is computationally intractable. To reduce the computational burden, we pre-select the traits that show suggestive evidence of association with the given SNP using a liberal p-value threshold, denoted by \({p}_{{thr}}\) (e.g. 0.1 or 0.05). We only conduct subset search among the traits with p-value \( < \, {p}_{{thr}}\). Denote the corresponding threshold for z-statistic by \({z}_{{thr}}={\Phi }^{-1}(1-\frac{{p}_{{thr}}}{2})\).Before the subset search, the z-statistics \({z}_{k}\) (\(k=1,\ldots,K\) traits) need to be adjusted for pre-selection to prevent type I error inflation. If the SNP is not associated with any of the given traits and none of the studies have overlapping samples, \({z}_{k}\)’s are independent across studies and can be adjusted independently. The adjusted p-value is$${\widetilde{p}}_{k}=P \left(|Z| \, > \, |{z}_{k}|\big||Z| \, > \, {z}_{{thr}}\right)=\frac{P\left(|Z| \, > \, |{z}_{k}|,\,|Z| \, > \, {z}_{{thr}}\right)}{P\left(|Z| \, > \, {z}_{{thr}}\right)}\\=\frac{P\left(|Z| \, > \, |{z}_{k}|,\,|Z| \, > \, {z}_{{thr}}\right)}{{p}_{{thr}}}$$
(1)
which is equal to \(\frac{P\left(\left|Z\right| \, > \, \left|{z}_{k}\right|\right)}{{p}_{{thr}}}\) if \(|{z}_{k}| \, > \, {z}_{{thr}}\), and equal to 1 if \(\left|{z}_{k}\right|\le {z}_{{thr}}\). Hence the adjusted z-statistic is \({\widetilde{z}}_{k}={{\rm{sign}}}\left({z}_{k}\right)\,{\Phi }^{-1}(1-\frac{{\widetilde{p}}_{k}}{2})\).However, the data used for our study have complex sample overlaps, hence \({\widetilde{z}}_{k}\)’s of overlapping studies are no longer independent. To apply the above adjustment, we first de-correlate the z-statistics. The correlation of z-statistics due to sample overlap can be estimated using bivariate LD score regression3:$$E\left[{z}_{1j}{z}_{2j}\, \big|\,{l}_{j}\right]=\frac{\sqrt{{N}_{1}{N}_{2}}{\rho }_{g}}{M}{l}_{j}+{\rho }_{12}^{(z)}$$
(2)
where \({\rho }_{12}^{(z)}\) is the correlation of \({z}_{1j}\) and \({z}_{2j}\) under the null hypothesis and \({l}_{j}\) is the LD score. Since sample overlap usually occurs between traits in the same study or consortium, the correlation matrix of z-statistics \({{{\boldsymbol{\rho }}}}^{(z)}={\left\{{\rho }_{{kl}}^{\left(z\right)}\right\}}_{k,l=1,\ldots,K}\) roughly follows a block diagonal structure. We first partition the traits into blocks using hierarchical clustering based on distance matrix \(1-|{{{\boldsymbol{\rho }}}}^{\left(z\right)}|\). We cut the hierarchical clustering tree at 0.8 such that \(\left|{\rho }_{{kl}}^{\left(z\right)}\right| \, < \, 0.2\) if studies \(k\) and \(l\) are in different clusters. We de-correlate the z-statistics within each cluster and ignore between-cluster correlation.Within each cluster, we first order the studies by effective sample size from smallest to largest. For continuous traits, the effective sample size is the total sample size; for binary traits, the effective sample size is defined as \(\frac{{N}_{{case}}{N}_{{control}}}{{N}_{{case}}+{N}_{{control}}}\). Denote by \({{{\boldsymbol{\rho }}}}^{(z,t)}\) the LD score regression intercept matrix of the traits in cluster \(t\). Denote by vector \({{{\bf{Z}}}}^{(t)}\) the z-statistics of traits in cluster \(t\), and by \({{{\bf{N}}}}^{(t)}\) the effective sample size. If trait \(k\) is continuous, \({N}_{k}^{(t)}\) is the total sample size; if trait \(k\) is case-control, \({N}_{k}^{(t)}=\frac{{n}_{{case}}{n}_{{control}}}{{n}_{{case}}+{n}_{{control}}}\). Define \({{{\bf{S}}}}^{(t)}=\frac{1}{\,\sqrt{{{{\bf{N}}}}^{(t)}\,}}\) as the standard error when genotype and traits are standardized to have unit variance. The de-correlation and adjustment algorithm proceeds as follows:

1.

Apply Cholesky decomposition to \({{{\boldsymbol{\rho }}}}^{(z,t)}={U}^{T}U\).

2.

De-correlated the z-statistics by \({\widetilde{{{\bf{Z}}}}}^{(t)}={({U}^{T})}^{-1}{{{\bf{Z}}}}^{(t)}\). Select the traits with \(\left|{\widetilde{z}}_{k}^{\left(t\right)}\right| \, > \, {z}_{{thr}}\) and adjust the z-statistics independently for each trait using the conditional p-value approach described by Eq. (1). Denote the adjusted z-statistics by \({{\widetilde{{{\bf{Z}}}}}_{{adj}}}^{(t)}\).

3.

Convert the z-statistics back to the original scale by \({{{{\bf{Z}}}}_{{adj}}}^{(t)}={U}^{T}{{\widetilde{{{\bf{Z}}}}}_{{adj}}}^{(t)}\). Leave the standard errors \({{{\bf{S}}}}^{(t)}\) unchanged. Retain the elements of \({{{{\bf{Z}}}}_{{adj}}}^{(t)}\) and \({{{\bf{S}}}}^{(t)}\) corresponding to the selected traits and denote them as \({{{\bf{Z}}}}_{{scr}}^{(t)}\) and \({{{\bf{S}}}}_{{scr}}^{(t)}\).

Finally, we combine the adjusted z-statistics and standard errors (\({{{\bf{Z}}}}_{{scr}}^{(t)}\) and \({{{\bf{S}}}}_{{scr}}^{(t)},{t}=1,\ldots,T\)) for the selected traits from all clusters. The combined summary statistics are used as input for ASSET analysis (see Supplementary Notes for a brief description of ASSET)7.Analysis of data for 116 traitsSince different studies are relatively independent, LD score regression3 analysis did not reflect substantial sample overlap across different studies. The clusters, based on sample overlap and phenotypic correlation are generally small and restricted to traits within the same study (Supplementary Fig. 2). For a small proportion of the SNPs, a large number of traits pass the pre-selection threshold (single-trait p-value < 0.05) which makes the subsequent subset search computationally intractable. We removed a total of 16,686 SNPs that were associated with more than 16 traits in one direction (either positive or negative). Under the global null hypothesis of no association, the probability of observing such pattern is small (\(1.13\times {10}^{-8}\)) and thus we consider these SNPs to be pleiotropic though they are not further analyzed by ASSET. In fact, the majority of these SNPs can be tagged by SNPs that are analyzed by ASSET through LD (r2 > 0.5) and hence removing them does not lead to significant loss of information, except 309 SNPs from 74 independent regions (r2 < 0.1, > 500 kb apart). We choose the SNP with the largest number of traits that pass p < 0.05 threshold as the index SNP for each locus. As a sensitivity analysis, we allocate additional computational resources to run fastASSET analysis for the 74 lead SNPs. These SNPs are enriched in pleiotropic SNPs, but some of the SNPs are associated with only a few traits (Supplementary Fig. 10). Due to their small number, they should not have a major impact on the characterization of pleiotropy.Among the SNPs we analyzed by ASSET, we further removed 2620 SNPs for which one of the one-sided ASSET p-values is smaller than the p-values from standard meta-analysis of the selected traits, indicating convergence issues in the underlying p-value approximation method.For each of the remaining 7,443,160 SNPs, fastASSET reports a p-value for global association and a set of traits associated with the SNP. For each SNP, we use the number of its associated traits reported by fastASSET as the metric of pleiotropy. We consider SNPs with fastASSET p-value \(p \, < \, 5\times {10}^{-8}\) as genome-wide significant. LD clumping leads to 10,628 independently associated SNPs with \({r}^{2} \, < \, 0.1\). We then group these SNPs into 2293 independent loci of which the lead SNP (lowest p-value within the locus) are at least 500 kb apart. As a metric of signal density within each locus, we count the number of independently associated SNPs within 100 kb of the lead SNP.We are especially interested in trait-specific SNPs (associated with only 1 trait as identified by fastASSET) and secondarily highly pleiotropic variants with associated with >15 traits.Validation in UK BiobankWe download the UK Biobank (UKB) GWAS summary statistics from Neale lab34. Among the 2293 index SNPs of the independent loci identified by fastASSET, 6 SNPs are not included in the UK Biobank summary statistics. For each of those SNPs, we identify a proxy SNP in the UKB summary statistics as the one with the highest \({r}^{2}\) within 100 kb of the lead SNP. If \({r}^{2} \, < \, 0.8\) between the lead SNP and the proxy SNP, we exclude the locus from the validation study in UKB. We successfully found proxies for 5 index SNPs and one other (rs12203592) does not have a proxy SNP with \({r}^{2}\ge 0.8\).Among the 11,934 summary level datasets published by Neale lab (round 2 GWAS, August 1st, 2018), we removed the GWAS results for age, sex and 22 datasets that do not have a phenotype code. For continuous traits, we use the summary dataset for the inverse rank normalized trait (variable type continuous_irnt) and discard the dataset for the raw trait (variable type continuous_raw). We only keep the datasets for joint analyzes across both sexes and discard the sex-stratified results. In addition, there are two versions of GWASs for a subset of 166 traits. We keep the most recent version (v2) and remove the first version. After filtering, we retain 4114 summary level datasets for our validation study. For each the 2292 index SNPs, we select the associated traits among the remaining 4114 UKB traits by a per-SNP FDR threshold of 0.05. Note that we are not able to conduct an independent validation study due to substantial sample overlap between our primary discovery data and UKB. Instead, we use UKB data to test the generalizability of the estimated degree of pleiotropy from 116 traits to a larger number of traits.Relationship between pleiotropy and LDTo study the relationship between pleiotropy and LD, we estimate LD scores from 1000 Genomes Phase 3 EUR sample using SNPs with MAF > 0.01 and 1 centiMorgan (cM) window. The calculation is performed using ldsc software55. Note that here we choose to estimate LD scores using all the SNPs in our analysis instead of using those downloaded from LDSC, which include only HapMap3 SNPs.eQTL status lookup and colocalizationTo explore potential cis-regulatory mechanisms that may drive pleiotropy, we look up the lead SNPs of the 2293 loci identified by fastASSET to explore the relationship between the pleiotropy and eQTL tissue/gene specificity. We accessed eQTL summary statistics for 49 tissues in GTEx v839. We lift over the base pair coordinates from hg19 to hg38 to match the genome build of GTEx v8. For each of the lead variant, we count the number of tissues in which it is a significant eQTL for at least one gene with q-value < 0.05, and the total number of unique genes for which it is a significant eQTL regardless of tissues.For each trait-specific locus, we conduct colocalization analysis between the GWAS signal for the corresponding trait and eQTL signals in GTEx v8. We restrict to protein-coding genes and gene-tissue pairs for which the lead SNP of the locus is a significant eQTL at q-value < 0.05. We use the SNPs within 50 kb from the lead SNP as input to COLOC47. We consider the GWAS and eQTL signals to be colocalized if the posterior probability of colocalization (PP4) > 0.8.Chromatin stateTo explore potential relationship between pleiotropy and chromatin states, we queried their chromatin state using 15-state chromHMM model56 in the Haplogreg v4.1 database57. The chromatin state was learned using a core set of 5 histone marks (H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3) for 111 epigenomes from the Roadmap Epigenomics Project48. We defined chromatin states with state number <=7 as an active and open chromatin state, including active transcription start site (1_TssA), flanking active TSS (2_ TssAFlnk), transcription at gene 5′ and 3′ (3_TxFlnk), strong transcription (4_Tx), weak transcription (5_TxWk), genic enhancers (6_EnhG) and enhancers (7_Enh). For each variant, we count the number of datasets in Roadmap whether it falls in active chromatin state as well as the number of broad categories provided by Roadmap (https://egg2.wustl.edu/roadmap/web_portal/meta.html). We also study and evaluate the association of pleiotropy with each active chromatin state separately. For example, we select the lead SNPs identified by fastASSET that are 1_TssA in at least one Roadmap dataset and investigate the relationship between the number of traits selected by fastASSET and the number of tissue/cell type groups in which this variant is in 1_TssA. We perform same analysis for other active chromatin states to learn the association. The tissue and cell type groups are defined by the Roadmap Epigenomics Consortium as provided in the link: (https://docs.google.com/spreadsheets/d/1yikGx4MsO9Ei36b64yOy9Vb6oPC5IBGlFbYEt-N6gOM/edit#gid=15).In addition to chromHMM based model, we conduct similar analysis using the chromatin states learned by the IDEAS, an integrative and discriminative epigenome annotation system employing 2D genome segmentation method35,36 to jointly characterize the chromatin states across many different cell types. Since the IDEAS classification of chromatin states is marginally different from ChromHMM, we consider 10_TssA (Active Transcription Start Site), 8_TssAFlnk (Flanking Active TSS), 14_TssWk (Weak TSS), 5_Tx (Strong Transcription), 2_TxWk (Weak Transcription), 4_Enh (Enhancers), 6_EnhG (Genic Enhancers), 17_EnhGA (Active Genic Enhancers) as an active and open chromatin state. IDEAS algorithm here integrates epigenomes by preserving the position-dependency and cell type specific epigenetic events at fine scales36.The IDEAS bigBed files were downloaded from http://bx.psu.edu/~yuzhang/Roadmap_ideas/.The bigBed files were converted to bed format using UCSC program bigBedToBed program fetched from the directory of binary utilities in http://hgdownload.cse.ucsc.edu/admin/exe/.The bed files thus processed were used for further downstream analysis, including the association of pleiotropy and epigenetic states. Blacklisted genomic region was filtered out where applicable as provided by ENCODE.Transcription factor binding sitesTo understand potential relationship between pleiotropy transcription factor (TF) binding, we downloaded and referenced the JASPAR 201837 and HOCOMOCO V1138, homosapiens comprehensive model collection of motif database. Intersection of variants with TF binding sites was performed by Bedtools v2.29.258 to observe the association between SNPs and transcription factor binding sites.To associate level of pleiotropy with TF binding profiles, we divide the 2,293 lead SNPs into two pleiotropy bins: 1) associated with 1–10 traits 2) associated with >10 traits. We calculate the proportion of SNPs in each bin that overlaps with transcription binding sites (TFBS). To explore whether this relationship is independent of the effects of other annotations, we fit the following logistic regression: (associated with >10 traits) ~(overlapping with TFBS) + (LD score) + (number of tissues for which the variant is an eQTL) + (number of eGenes) + (number of cell types for which the variant is in active chromatin state) and check the significance of the first predictor.Enhancer-gene connectionThe activity-by-contact (ABC) model combines chromatin states with 3-dimensional contacts to map enhancers to their target genes59,60. We download the enhancer-gene map for 131 human cell types and tissues constructed by the ABC model from the Engreitz lab website (https://www.engreitzlab.org/resources/). This dataset includes all enhancer-gene connections with ABC scores >= 0.015.Among the lead SNPs that overlap with enhancers, we compute the correlation between the level of pleiotropy and the number of cell types for which the overlapping enhancer affects at least one target gene by the ABC model. We further compute the partial correlation adjusting for other annotations: LD score, number of tissues for which the variant is an eQTL, number of eGenes, number of cell types for which the variant is in active chromatin state.Matching trait-specific SNPs to highly pleiotropic SNPsIn the functional follow-up studies of 21 trait-specific SNPs, we match each of them to one SNP that is also associated with the given trait but show high-degree of pleiotropy (associated with >15 traits). For each trait-specific SNP (e.g. for Crohn’s disease), we examine the individual-trait p-value between the trait (e.g. Crohn’s disease) and all the highly pleiotropic SNPs and choose the one that shows the strongest association (lowest p-values) with the given trait (e.g Crohn’s disease). In this procedure, multiple trait-specific SNPs for one trait could be matched to the same highly pleiotropic SNP. This matching procedure applies to the examination of top 10 associated traits for lead SNPs (Fig. 5), eQTL analysis (Fig. 6) and chromatin state analysis (Fig. 7).Analysis of Biobank Japan dataWe download the summary statistics for 220 phenotypes in Biobank Japan (BBJ)22. We run LD score regression3 to estimate the genetic correlation (slope) and phenotypic correlation (intercept) of these traits. LD scores of East Asian population are obtained from GitHub repository ldsc (https://github.com/bulik/ldsc). We apply similar quality control pipeline as in the analysis of European data: 1) remove continuous traits with total sample size <5,000 or case-control disease traits with <2,000 cases or <2,000 controls; 2) remove traits with heritability z-score <1.96; 3) remove highly correlated traits with genetic correlation > 0.95 or < − 0.95; 4) remove SNPs with MAF < 0.01 in BBJ. After filtering, we retain 75 traits and 4,965,789 SNPs for fastASSET analysis. Genome-wide significant SNPs are defined as those with fastASSET global association p-value \(\, < \, 5\times {10}^{-8}\). We conduct LD clumping based on fastASSET p-values for global association using genotype data of 1000 Genomes Phase 3 East Asian population (N = 504) as the reference panel. We require the lead SNPs of different loci to be nearly independent (\({r}^{2} \, < \, 0.1\)) and at least 500 kb apart. To compare patterns of pleiotropy between East Asian and European ancestry, we obtain the summary statistics for European ancestry GWAS used in Sakaue et al.22 (UK Biobank and FinnGen). We successfully match 72 (out of 75) traits from BBJ in our analysis to traits in UK Biobank and FinnGen.Inclusion & ethics statementThis study uses publicly available data. Ethics approvals were obtained by the original studies that generated the data.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles