CoPheScan: phenome-wide association studies accounting for linkage disequilibrium

Overview of CoPheScanCoPheScan is an adaptation of the coloc9,10,11 approach, for the case where a variant known to be causal either through fine-mapping or functional studies, is subjected to a phenome-wide scan to test for causal associations with other phenotypes/traits. Coloc considers the genetic association patterns for two traits in a genomic region and assesses whether it is likely they share a causal variant in that region. It is a Bayesian approach and assumes prior probabilities for each of the five possible hypotheses (no association with either trait, association with just one trait or the other, association with both traits and different causal SNPs, or association with both traits at the same causal SNP) are fixed and known.We consider the case where a SNP of interest is known to be causal for a phenotype which is often the case in PheWAS, and we are interested in determining if it is also causally associated with another phenotype (Fig. 1a). We will hereafter refer to the variant of interest as the query variant, the phenotype for which the query variant is known to be causally associated as the primary trait and the phenotype to be tested as the query trait. In a genomic region with \(Q\) SNPs, and under the initial assumption of a single causal variant (which we will relax later), there are \(Q+1\) possible ways or “configurations”, (Supplementary Fig. 1) to describe where the single causal variant may lie, each corresponding to exactly one of three hypotheses:Fig. 1: Introduction and evaluation of the CoPheScan method.a CoPheScan methodology: Hypotheses with illustrations of the configurations of genetic variants within the genomic region and corresponding priors. b Schematic of the CoPheScan workflow. The inputs are GWAS summary statistics from multiple traits and the position of the query variant. Computation of the posterior probabilities of the three hypotheses is performed with priors and Bayes factors computed using different CoPheScan approaches. c Study design for evaluation: Simulated data — Generated using SimGWAS and all CoPheScan approaches were run on this set. Real data — Phenotypes tested were obtained from UK Biobank and variants from fine-mapping FinnGen and a proteome dataset25. Hierarchical priors and SuSIE BF were used on the real data to identify SNP-disease associations. (QV - query variant, QT - query trait).\({H}_{n}\): No association of any variant with the query trait (one configuration)\({H}_{a}\): Causal association of a variant other than the query variant with the query trait (\(Q-1\) configurations)\({H}_{c}\): Causal association of the query variant with the query trait (one configuration)The posterior odds for each hypothesis \(\left(H\right)\) given the data \(\left(D\right)\) for the query trait with respect to the null hypothesis (\({H}_{n}\)) is given by,$$\frac{P\left(H | D\right)}{P\left({H}_{n} | D\right)}=\frac{P\left(H\right)}{P\left({H}_{n}\right)}\times \frac{P\left(D | H\right)}{P\left(D | {H}_{n}\right)}$$
(1)
In Eq. (1), the first ratio in the right-hand side is the prior odds and the second ratio is the Bayes Factor (BF). Thus, the three prior probabilities that have to be specified are: \({p}_{n}=p({H}_{n})\), \({p}_{a}=p({H}_{a})/(Q-1)\), and \({p}_{c}=p({H}_{c})\), subject to the constraint that \({p}_{n}+\left(Q-1\right){p}_{a}+{p}_{c}=1\).Beyond the difference in the hypothesis space described above, CoPheScan differs from coloc in two further ways. First, because we have reduced the hypothesis space, we can examine many variants simultaneously, allowing us to learn the priors from the data in a hierarchical Bayesian manner with Markov Chain Monte Carlo (MCMC) sampling (Supplementary methods). In contrast, coloc assumes priors are fixed and known, which is a weakness because inference must rely on the investigators’ judgement on prior probabilities of colocalisation. Second, because we are using this hierarchical approach, we can exploit additional external information about the variants and/or the traits in the form of covariates which can be included when learning the priors. This allows the priors to vary depending on the query trait/query variant pairs being considered. Here, we include the genetic correlation (\({r}_{g}\)) between the primary trait and each query trait tested (see Supplementary Methods).The restriction to a single causal variant allows us to count the possible configurations (\(Q+1\)), and if the assumption is deemed valid, CoPheScan can be run directly on summary GWAS data using Wakefield’s method12, to compute approximate Bayes factors summarising the relative support for a model where the SNP is associated with a trait compared to the null model of no association. However, this assumption is not broadly valid, and an alternative is to use the Sum of Single Effects (SuSiE) Bayesian fine mapping regression framework13,14 to partition the evidence into configurations corresponding to each of multiple possible causal variants and use these in a similar manner to allowing for multiple causal variants in coloc10. The SuSiE approach works best with either raw genotype data or summary GWAS data when in-sample LD information is available15.Hence, CoPheScan has the flexibility to be run in several ways (Fig. 1b) depending on: (i) the assumption about the number of causal variants, (ii) the specification of either fixed or hierarchical priors, and (iii) the inclusion/exclusion of covariates if the hierarchical model is used to infer priors. A detailed description of the CoPheScan method is available in the Supplementary methods. A summary of the simulated data, variant and phenotype sources used for the analysis with the real data can be found in (Fig. 1c), while a detailed description is provided in the Methods.Simulations show CoPheScan is more accurate than a standard method which does not account for LD confoundingWe simulated regional GWAS summary data for traits with either zero, one or two causal variants (Methods) such that they corresponded to the three CoPheScan hypotheses. We also allowed the probability of Hc to vary according to a simulated genetic covariance between primary and query traits and considered whether including this information in the analysis increased inferential accuracy. We analysed the same data in parallel using a conventional PheWAS approach of testing each of the set of query SNPs for association, controlling either the FDR or the family-wise error rate via Bonferroni correction. We compared these to the results from CoPheScan, using a hierarchical model (with and without the covariate data) or fixed priors chosen as described in the Supplementary methods which broadly matched the proportion of Hn, Ha, and Hc in the sample.First, we considered the appropriate threshold on the posterior probability of Hc, ppHc, to call an association. We estimated the FDR internally, as 1-mean(ppHc) | ppHc > t for different values of threshold t (Supplementary Fig. 4). We found that ppHc > 0.6 maintained an FDR < 0.05 across all analyses of simulated data. Using this threshold, CoPheScan appeared less sensitive to the presence of a single causal variant (true Hc) than the conventional BH approach but more sensitive than the Bonferroni approach (Fig. 2). CoPheScan demonstrated control of the FDR (0.026–0.039) estimated as the proportion of significant calls that were truly Hn or Ha, traits where the query variant was not causal, for the different CoPheScan approaches compared to 0.219 and 0.308 for the conventional BH and Bonferroni approaches respectively, (Supplementary Data 1). The majority of the false positives obtained from these conventional approaches were true Ha but called as associated due to LD confounding. All CoPheScan approaches performed well in the case of a single causal variant, but when there were two causal variants (True Hc2), using SuSIE resulted in ~30% higher sensitivity to correct Hc predictions than the ABF approach (Fig. 2). This was balanced against marginally lower (<0.5%) sensitivity to Hc with SuSiE when traits truly had only a single causal variant (True Hc) when compared to the CoPheScan approaches that assumed a single causal variant.Fig. 2: Results for hypotheses discrimination in simulated data.We called a single result for each simulated trait as described in Methods. The x-axis shows the percentage of hypothesis calls using the different approaches shown on the y-axis. For CoPheScan (top 6 rows), the three labelled columns on the y-axis, from right to left, indicate the type of priors used, the method used to calculate Bayes factors, and whether or not genetic correlation (\({r}_{g}\)) was used. The last two rows show conventional approaches controlling the FDR (BH Benjamini-Hochberg) or the FWER (Bonf Bonferroni) at 0.05. The top bar shows an illustration of the configuration of SNPs in the genomic region corresponding to the different simulated traits (Methods), with the queried variant at position 1 and causally associated (non-associated) variants indicated by filled (open) circles. [True Hn: no causal variant, True Ha/Ha2: one/two causal non-query variants, True Hc: causal query variant, True Hc2: causal query variant and one causal non-query variant].Although the effect of including covariate information was minor overall, Fig. 3a shows that it had a substantial effect in a minority of cases, bringing ppHc from below to above 0.6 in 2.79% of true Hc and Hc2 cases (80/2867), although also in 0.088% of true Hn and 0.011% of true Ha and Ha2.Fig. 3: Effects of covariate inclusion and varying proportions of simulated hypotheses.a Comparison of the posterior probability of Hc (ppHc) obtained with (y-axis) and without (x-axis) the inclusion of genetic correlation (\({r}_{g}\)) in the hierarchical model (using ABF). The panels represent the traits of different simulated hypotheses (Methods). b The proportion of Hc traits was varied as shown in the x-axis (dotted vertical lines), to compare Hc predictions using the fixed and hierarchical priors with different BF, both with and without the inclusion of the genetic correlation (\({r}_{g}\)) covariate. The number of true Hn and Ha traits were maintained the same at 88048 and 4700 respectively. The y-axis represents the estimated FDR—the proportion of traits assigned as Hc in each dataset which were simulated as Hn or Ha with 95% confidence intervals (dashed line — 0.05 FDR).Finally, these initial simulations showed that the hierarchical model recovered very similar results to the fixed prior model, where we chose our fixed prior values to broadly match the simulation scenarios, i.e., an optimal scenario. This offers reassurance that the hierarchical model can perform just as well as a method that “knows” the correct prior values. However, in real data, we will not know the true proportion of Hn, Hc, or Ha in our data, so we explored the robustness of both approaches to variations in these proportions. We found that using over-optimistic fixed priors, i.e. when the prior probability for Hc (P(Hc) = 0.091) exceeded the proportion of Hc in our data, led to dramatically high FDR, whilst the hierarchical model correctly adapted to the different datasets so that the FDR was controlled except at the very lowest true proportions of Hc (Fig. 3b).Using genetic correlation as a covariate increases detection of associations with disease-causal variantsWe explored the performance of CoPheScan (Supplementary Fig. 5) using a variety of causal variants sets to perform PheWAS in three sets of query variants in up to 2275 query traits (Supplementary Figs. 6 and 7) from the UK Biobank summary data provided by the Neale Lab (http://www.nealelab.is/uk-biobank/). First, 136 disease-causal variants were identified as single variant credible sets in fine mapping data from FinnGen disease endpoints (primary traits, https://www.finngen.fi/en/access_results). We identified causal associations in UKBB at 43 (31.62%) of these, predominantly amongst query traits identical or related to the primary trait. Out of 101 unique query-variant-primary trait pairs with exact query variant-query trait matched pairs in UKBB, 32 were found to be Hc (Supplementary Fig. 8), and 65 Hn due to a lack of power in UKBB (p-value > 10−5). Four cases were called Ha, and in these the UKBB p-value was small, but the fine mapping produced different results in UKBB and FinnGen (Supplementary Fig. 8).Genetic correlation (\({r}_{g}\)) information for only 1582 out of the 2275 traits used in analysis without \({r}_{g}\) was available. \({r}_{g}\) values between the 1582 query traits and 69 UKBB traits which were matched with the FinnGen primary traits were used as a covariate (130697 query trait-query variant pairs tested). Including \({r}_{g}\) in the hierarchical model made a larger difference here than in the simulated data, perhaps reflecting a stronger effect than we anticipated in our simulations. Overall, ppHc values for traits with higher \({r}_{g}\) with the primary traits increased and, conversely, decreased for traits with lower (negative) \({r}_{g}\) (Fig. 4). Incorporating the \({r}_{g}\) resulted in the identification of 19 additional associations (Supplementary Data 8). For example, the variants rs3217893_C > T and rs2476601_A > G, fine-mapped for type 2 diabetes and rheumatoid arthritis (RA) in FinnGen respectively, were found to have associations with medications gliclazide, which is a sulfonylurea used in the treatment of Type 2 diabetes, and steroid prednisolone which can be used to treat RA, only when the genetic correlation information was included.Fig. 4: Genetic correlation detects additional phenotypes.Hierarchical models of the FinnGen/UKBB dataset with/without genetic correlation (\({r}_{g}\)). The posterior probability of Hc (ppHc) of traits with and without the inclusion of genetic correlation (\({r}_{g}\)) are shown on the y and x axes respectively. The arrows represent the traits which show a difference of >0.1 ppHc after inclusion of \({r}_{g}\) (compared to the model without) and also have a ppHc > 0.6. The traits are coloured to represent their \({r}_{g}\) with the primary trait.Query variants were often associated with multiple UKBB traits (median 5) that reflected related diseases and medications (Supplementary Data 8). For instance, rs11591147_G > T, a missense variant of PCSK9, identified as a disease-causal variant in FinnGen for statin medication was found associated with the UKBB traits related to different statin medications along with several cardiovascular traits. Less commonly, we found evidence for causal association of variants to seemingly unrelated traits. For example, rs9349379_A > G, an intron variant and eQTL for PHACTR1, identified by fine-mapping the FinnGen primary trait—triptan, which is a medication used to manage migraine, was found to be associated with several UKBB traits related to migraine such as the phenotype itself, migraine medications such as sumatriptan, ibuprofen and paracetamol and also the presence of family history. However, we also found associations with angina, myocardial infarction and ischaemic heart disease, with the migraine-protective allele acting as a risk factor for cardiovascular traits. This matches results from a Mendelian randomisation study of migraine and cardiovascular disease16 but is in contrast to observational studies where migraine is considered positively associated with cardiovascular traits17. Such discrepancies between genetic and observational studies in other traits have often been resolved in favour of the genetic result, through the identification of some confounding factor which led the observational studies to report inverse relationships, and it has been suggested that certain non-triptan migraine therapies might act to increase cardiovascular risk16. However, this pleiotropy did not appear at another migraine-identified variant, rs11172113_T > C, an intronic variant of LRP1, which was fine-mapped for the same FinnGen primary trait of migraine and found to be independently associated with several migraine-related phenotypes in UKBB but not with any of the cardiovascular traits (Fig. 5).Fig. 5: Causal associations of Migraine related variants.Hc associations (ppHc > 0.6) of the PHACTR1 variant rs9349379_A > G and rs11172113_T > C, a LRP1 variant. The direction of effect (beta) is shown with respect to the G and C allele respectively.Other examples of pleiotropic variants include rs2476601, a non-synonymous variant in PTPN22 which we found to be causally associated with multiple autoimmune diseases and their treatments as well as skin cancer, with the autoimmune-protective allele increasing risk of cancer (Supplementary Fig. 9). We also found a complex set of associations with two variants in APOE, rs429358 and rs7412 that jointly define the three major structural isoforms of APOE18, ε4, ε3 and ε2 (Supplementary Fig. 10). ε2 represents the TT haplotype corresponding to the rs429358 and rs7412 variants, ε3 is represented by TC and ε4 by the CC haplotype19. We found associations with increased risk of Alzheimer’s disease, statin medication, angina and ischaemic heart disease with the ε4 allele with reference to the ε2/ε3 genotype. We also found a protective effect of ε4 compared to ε2/ε3 on traits related to a family history of diabetes and blood pressure which correspond to similar traits found in FinnGen as well as a protective effect of ε3/ε4 compared to ε2 for deep venous thrombosis might be related to the ε3/ε4 genotype with reference to ε2 and might indicate the ε2 allele. These findings align with previous studies on disease associations with different APOE genotypes20 and highlight the ability of SuSiE to map traits to distinct alleles in LD.Individual variant analysesCoPheScan can also be used to study single variants if sensible prior values can be supplied. We considered exemplar non-synonymous variants in two genes, TYK2 with established allelic heterogeneity and associations to multiple immune-mediated diseases, and SLC39A8, with established pleiotropic function. We ran CoPheScan with SuSiE BF and priors inferred from the disease-causal variant analysis above (\({p}_{a}\approx\) 3.82e-5 and \({p}_{c}\approx\) 1.82e-3), considering as query traits 2275 UKBB and 56 additional traits potentially related to either gene from the GWAS Catalog (Supplementary Data 3).TYK2 which encodes the tyrosine kinase 2 enzyme has multiple missense variants that have been associated with a range of immune-mediated diseases (Supplementary Data 11). We considered four: rs35018800_G > A (MAF: 0.0082), rs34536443_G > C (MAF: 0.0465), rs12720356_A > C (MAF: 0.0979), and rs55882956_G > A (MAF: 0.0017). rs35018800_G > A and rs55882956_G > A with the lowest MAF showed no association with any trait. rs34536443_G > C was associated with 3 UKBB and 5 GWAS Catalog traits, all immune-related and previously established associations, including psoriasis, RA, JIA (Juvenile Idiopathic Arthritis), Type 1 DM, and hypothyroidism. The variant rs12720356_A > C was associated with ulcerative colitis, psoriasis, Crohn’s disease, SLE (Systemic Lupus Erythematosus) and RA traits from the GWAS Catalog, but not with any of the UKBB traits (Fig. 6).Fig. 6: CoPheScan analysis of a gene with allelic heterogeneity: TYK2.Plots showing Hc associations (ppHc > 0.6) of TYK2 variants rs34536443_G > C and rs12720356_A > C. The direction of beta is shown with respect to the ALT allele (C in both cases). T1D Type 1 Diabetes Mellitus, JIA Juvenile Idiopathic Arthritis, PSO Psoriasis, RA Rheumatoid Arthritis, SLE Systemic Lupus Erythematosus, CD Crohn’s Disease, UC Ulcerative Colitis.The highly pleiotropic variant, rs13107325_C > T, of SLC39A8 (solute-carrier family gene which encodes the ZIP8 protein), was associated with 14 UKBB and 3 GWAS Catalog phenotypes, replicating several known associations21 with hypertension, schizophrenia, Crohn’s disease, urinary incontinence, musculoskeletal system-related traits such as osteoarthritis and traits related to alcohol dependence.We used this region to perform a sensitivity analysis, selecting four variants—rs6855246, rs35225200, rs35518360, rs13135092, in LD with rs13107325_C > T (r2 = 0.816–0.943) and running CoPheScan as if each had been selected as the causal variant. This allows us to explore two related questions: either, to what extent can two causal variants in LD cause false positive findings, or, to what extent CoPheScan might still detect an association if the “causal” variants supplied to CoPheScan are not really causal, but in LD with the causal variant. We found that CoPheScan was indeed sensitive to this misspecification, where out of the 17 traits identified as causally associated with rs13107325, 4 had ppHc < 0.6 with rs13135092 (r2 = 0.943) and 11 with rs6855246 (r2 = 0.816). The results were increasingly discrepant as the r2 with rs13107325_C > T decreased (Fig. 7 and Supplementary Fig. 11). The group of traits with high ppHc across multiple variants tended to have larger minimum p-values in the region compared to those for which ppHc was low across multiple variants, suggesting that CoPheScan will be best at discriminating between potential causal variants in LD when the association signal in the query data is strong.Fig. 7: Sensitivity analysis with SLC39A8 variants.Heatmap of ppHc of SLC39A8 variants with LD r2 > 0.8 with the rs13107325_C > T (bottom row) variant. The points are scaled based on their unadjusted -log10(pval) as given in the input summary GWAS data and a red border indicates a ppHc of >0.6. CD Crohn’s disease, SCZ schizophrenia.Finally, we sought to verify previously proposed causal associations between the HMGCR variant rs12916_T > C and metabolic traits. HMGCR encodes HMG-CoA reductase which is targeted by statins to lower LDL cholesterol. Previously, HMGCR variants have been used as a proxy for statin effect to show a higher risk of type 2 diabetes and body mass index (BMI) in MR studies22. But the validity of this has been challenged with evidence that there may be distinct causal variants underlying type 2 diabetes, BMI and HMGCR levels23. We performed CoPheScan analysis on the UKBB traits: LDL, BMI, type 2 diabetes, waist circumference and weight. We identified a known causal association with LDL (ppHc = 1). Despite significant observed p-values at rs12916 at BMI, weight and waist circumference, CoPheScan consistently concluded that while the region contained a causal variant for each trait, it was not rs12916 (ppHa > 0.99). In fact, no credible sets were identified in the HMGCR gene region and the SuSIE signals from these traits indicate the presence of an alternative causal POC5 variant (Supplementary Fig. 12). This implies that genetic studies that demonstrated a relationship between statin therapy and BMI/T2DM through HMGCR variants as a proxy might be incorrect23 as they studied the SNPs in isolation while ignoring their regional context24. CoPheScan is thus valuable in verifying assumptions in instrumental variable analyses.PheWAS of protein-associated variantsOne challenge of GWAS has been to link disease associations to their causal genes. PheWAS allows us to start with variants with known causal function on a protein and ask which diseases are also causally associated, exploiting the low false positive rate of CoPheScan. We began with 505 plasma protein QTLs25 identified as single variant credible sets in fine-mapping of 527 plasma proteins. Nine variants were identified to be associated with UKBB traits (Table 1 and Supplementary Data 9). Among the established associations, we found an association between a pQTL for APOC1 and high cholesterol, as well as reported treatment with the cholesterol-lowering simvastatin. Both associations make sense given the known biology of APOC1, but only the first would have been detected in scanning for significant p-values, as the p-value for high cholesterol at this SNP (p = 6.19 × 10−19) is much lower than for simvastatin (p = 9.59 × 10−4), emphasising the value of exploiting the additional information that we believe the variant to have a causal effect on a measurable phenotype (Supplementary Fig. 13).Table 1 Hc associations detected with pQTL variantsWe also found a novel association for rs214830_G > C, a pQTL for TGM3 which was associated with skin cancer (ppHc = 0.75). TGM3 is required for skin development and is normally expressed in the spinous/granular layers of the epidermis. Its expression was found to be absent in melanoma and squamous cell carcinoma of the skin but strongly expressed in basal cell carcinoma (BCC), suggesting it could be a specific marker for BCC diagnosis26. Association of variants in TGM3 with BCC have also been reported27,28,29 but rs214830_G > C was not always the top variant and GWAS associations can mark causal effects in neighbouring genes30. Our analysis suggests this association could be directly causal, with TGM3 involved in the development of BCC as well as acting as a biomarker.Finally, we considered 3586 variants labelled as protein-truncating (PTV) in the UKBB summary data with MAF > 0.001, consisting of those predicted by VEP to be stop_gained, frameshift, splice_acceptor and splice_donor. The fraction of query variants that were found to be causally associated with at least one trait in UKBB was much lower for PTV (~0.31%) than for disease-causal variants identified in FinnGen (~40%) and pQTL (~1.8%) (Table 2, Supplementary Figs. 5 and 6).Table 2 Summary of tested variants and phenotypes from real dataExamination of the Markov chain Monte Carlo (MCMC) chains showed issues with mixing for the PTV example which were not seen with the other datasets (Supplementary Fig. 5). When we examined the inferred priors (Supplementary Data 12) obtained from this model, we observed that the pc/pa ratio was ~1.02, indicating that the inferred pa and pc priors were almost the same. Our PTV consisted of four VEP classes, but while the MAF distribution of the stop-gained PTV was similar to missense variants, those of the other PTV (frameshift, splice donor and splice acceptor) were similar to synonymous variants (Supplementary Fig. 14a). As selection can constrain MAF, we hypothesised that the VEP stop_gained class might be more enriched for functional variation than the set of four classes we had used. We considered two ways to enrich the PTV set for functional variation: either using just this subset of the stop-gained PTVs or using the PTVs which were also defined as high confidence homozygous predicted loss-of-function (pLoF) variants in gnomAD31. pLoF were predominantly rare, such that the pLoF subset of PTV variants had a higher number of rare variants compared to the stop-gained subset (Supplementary Fig. 14b).We ran the hierarchical models for these two subsets of PTVs (Supplementary Fig. 15). Comparing the priors (Supplementary Data 12) across the different datasets tested we observed that the ratio of prior probabilities for the query variant or a non-query variant to be causal, pc/pa (Table 2) obtained using the pLoF variants (2.59) was second only to the ones obtained using the FinnGen disease-related variants. The ratio from the stop-gained variant model (1.39) was similar to the pQTL variant model (1.28). This shows that sets of query variants which have a higher functional enrichment are expected to have a high pc/pa ratio.26 associations were identified using all the PTV variants. All 15 associations detected with the stop-gained PTVs and 7 from the pLOF overlapped with those from the whole set. Of the combined 26 PTV-trait associations (Supplementary Data 10), many corresponded to known effects. One of them is, rs2066847_G > GC, a NOD2 frameshift mutation, which is reported as a pathogenic variant for inflammatory bowel disease in ClinVar and was associated with several phenotypes related to Crohn’s disease and mouth ulcers in our analysis (Supplementary Fig. 16). However, as seen with migraine and cardiovascular disease above, the association with mouth ulcers occurs in the opposite direction to the established comorbidity of Crohn’s disease and mouth ulcers in the population, with the Crohn’s disease risk allele appearing protective for mouth ulcers. Note that in the mouth ulcer trait, the effect sizes were opposite in two other SNPs identified as a credible set in SuSiE analyses of both traits (Supplementary Fig. 16).

Hot Topics

Related Articles