Systematic discovery of gene-environment interactions underlying the human plasma proteome in UK Biobank

Discovery of variance QTLs underlying the plasma proteomeIn the first stage of the study, Levene’s test with median was used to perform genome-wide vQTL analyses on blood levels of 1463 unique proteins (N ≤ 34,557). The 1463 proteins were measured by 1472 analytes using Olink technology. A Bonferroni-corrected significance threshold was set at p < 3.4 × 10−11, which reflected the adjustment of p < 5 × 10−8 (a commonly used threshold in GWAS)20 for 1463 proteins. There were 269,225 significant vQTL associations across 575 analytes at p < 3.4 × 10−11. The associations implicated 568 unique proteins. There was limited evidence for genomic inflation (range of λ = [0.9, 1.1], Supplementary Data 1). Six hundred and seventy-seven independent vQTLs were identified through linkage disequilibrium (LD) clumping (see “Methods”, Supplementary Data 2). Supplementary Data 2 and 3 show genomic annotations of the variants using the Open Targets platform21,22.Four hundred and seventy-three (69.9%) of the 677 independent vQTLs were cis effects (within 1 Mb from the gene encoding the protein) and the remaining 204 represented trans effects (Fig. 2a). The majority of proteins had one independent vQTL (488, 85.9%) and the maximum number of vQTLs per protein was 5 (for FOLR3 and PNLIPRP2, Fig. 2b). There was an inverse relationship between the logarithms of effect sizes and minor allele frequency (MAF) for both cis and trans loci (r = −0.45 and −0.46, respectively, MAF ≥ 5%, Fig. 2c). One vQTL was in strong LD (r2 = 0.87) with an expression vQTL at the same locus (rs858502 and rs66809776 for PILRA, respectively)23. However, no protein vQTL overlapped with known DNA methylation vQTLs24.Fig. 2: Genome-wide association studies to identify variance QTLs for 1472 blood protein measures in UK Biobank.a The 1472 analytes or measures represented 1463 unique proteins. Genome-wide variance QTL tests were performed using Levene’s test (two-sided). A Bonferroni-corrected significance threshold of p < 3.4 × 10−11 was set. The x-axis represents the chromosomal location of independent cis and trans vQTLs. The y-axis represents the position of the gene encoding the associated protein. Cis (red circles); trans (blue circles). b The number of independent vQTLs per protein. c Association between the common logarithm of minor allele frequencies and the natural logarithm of absolute effect sizes for cis and trans vQTLs. Cis (red circles and line); trans (blue circles and line). d The discovery and replication sets consisted of 34,557 and 17,806 participants, respectively. Pearson’s correlations between effect sizes in the discovery and replication sets are shown for vQTLs that were significant in the discovery set. An outlier is highlighted and resulted from differences in allele frequencies across distinct ethnic groups in the study sample. CI confidence interval, MAF minor allele frequency, vQTL variance quantitative trait locus.Six variants were not available for replication analyses as they had MAF < 5% in the replication set (N = 17,806). This left 671 variants for replication testing. Effect sizes for the variants were highly correlated between the discovery and replication sets (r = 0.96, 95% CI = [0.95, 0.97], Fig. 2d, Supplementary Data 4). Three hundred and ninety-six variants (59.0% of 671) survived Bonferroni correction in the replication set (p < 3.4 × 10−11), and had effect sizes that were directionally concordant with those from the discovery set. Of note, the discovery set had twice the sample size of the replication set. The discovery sample also contained white Europeans only whereas the replication set comprised 80% white Europeans and several other ethnic backgrounds (see Methods). It is likely that differences in ancestries underpinned variants whose effect sizes showed opposing directions across sets. A notable example was the cis vQTL rs35489971-A for CD300LF, which had an effect size of −0.35 in the discovery set and 0.13 in the replication set (highlighted in Fig. 2d). There was evidence for ancestry-specific effects for such outliers. The A allele increased the variance of CD300LF levels in those of Black/Black British ancestry but not in those of South Asian or European ancestries. The A allele is also the major allele in those of Black/Black British ancestry (frequency = 70%) but is the minor allele (<25%) in other ancestries. The ancestry-specific effects likely underscored the positive association between the A allele and CD300LF variance in the replication sample but not the discovery sample, which comprised individuals of European ancestry only. However, sensitivity analyses showed that vQTL effect sizes were largely comparable (r = 0.82) between individuals of non-European and European descent when matched for sample size (N ≤ 2518, Supplementary Data 5). The sample structure was retained, which enabled direct comparisons to a recent main effect QTL study by Sun et al.6, who implemented the same sampling strategy.Variance QTLs largely overlap with main effect QTLs for the blood proteomeThe majority of vQTLs (610, 90.1%) had main effect p values < 3.4 × 10−11 (Supplementary Data 6). Only fifteen vQTLs lacked a main effect finding at p < 0.05. Their p-values ranged from 0.08 to 0.98 and three loci were located within the complex MHC region. Figure 3a, b demonstrate examples of vQTL loci with and without marginal main effects, respectively. A reverse look-up strategy showed that only 3.4% of main effect QTLs (301 of 8856 variants at MAF ≥ 5%) had vQTL p values < 3.4 × 10−11. Over 23% of main effect QTLs (2053 of 8856) had vQTL p values below a more relaxed threshold of p < 0.05 (Supplementary Data 6). Therefore, most vQTLs were main effect QTLs but not vice versa.Fig. 3: Genetic architectures of main effect and variance QTLs for plasma protein levels.a Example of variance QTL (rs147072313) without corresponding main effect on protein levels (FLT3LG). Levene’s test was used to perform variance QTL tests (two-sided). Centre line of boxplot: median, bounds of box: first and third quartiles and tips of whiskers: minimum and maximum. b Example of variance QTL (rs12037951) with significant main effect on protein levels (ACP6). Levene’s test was used to perform variance QTL tests (two-sided). Centre line of boxplot: median, bounds of box: first and third quartiles and tips of whiskers: minimum and maximum. c Distributions of predicted functional annotation classes for all variance QTLs. Bar height represents the mean proportion of variants within each class. d Distributions of predicted functional annotation classes for main effect QTLs. TF transcription factor, UTR untranslated region, vQTL variance quantitative trait locus.Supplementary Fig. 1 shows the relationship between the significance of cis vQTLs or main effect QTLs and their distances from TSS. Figure 3c, d show the predicted functional classes of vQTLs and main effect QTLs. Most predicted classes exhibited little variation between these QTL types given their substantial overlap. However, a higher proportion of main effect QTLs were annotated to exons than vQTLs (17.3% vs. 10.3%). In balance, a lower proportion of main effect QTLs were annotated to intergenic sites when compared to vQTLs (18.9% vs. 28.2%)6.Olink assays also rely on limits of detection (LOD) for each protein and sample plate. The percentage of individuals falling below their LOD for each protein is shown in Supplementary Data 1. Individuals were not removed based on LOD in the consortium. It is challenging to select a threshold to filter proteins based on LOD (e.g. removing proteins that had >50% of individuals falling below the LOD). Such thresholds may be arbitrary. We characterised how many vQTL sites were associated with proteins that had an appreciable number of individuals falling below the LOD. Sixty vQTLs were associated with proteins that had >10% of individuals whose relative measurements were below the LOD for that protein. In addition, 39, 13 and 8 vQTLs associated with proteins that had >25%, >50% of >75% of individuals whose measurements did not surpass the LOD. If a variant influenced whether individuals surpassed the LOD, it could introduce disparities in trait variance across carriers and non-carriers. We conducted a sensitivity analysis to assess if vQTL variants were associated with whether individuals were above or below the LOD for their respective proteins (binarised outcome). Forty-one vQTLs (6.1%) were associated with assay detectability p < 7.4 × 10−5 (p < 0.05 adjusted for 677 tests, Supplementary Data 7). However, we did not observe a significant difference in LOD distributions between the 575 proteins with vQTL associations when compared to the 895 without (χ = 338, p = 0.90). vQTL variants were not associated with missingness due to other technical factors at p < 7.4 × 10−5 (Supplementary Data 8). Therefore, assay detectability, but not other technical considerations, may have exerted an effect on the discovery of a small subset of vQTL associations.Variance QTLs unveil gene-environment interactions dispersed across the plasma proteomeVariance QTLs may be explained by a number of potential influences, including gene-gene interactions (epistasis), statistical artefacts and gene-environment interactions. A particular strength of vQTL discovery has rested in nominating a priority set of genetic variants for gene-environment association tests15,17. In the second stage of the study, we therefore tested whether protein levels were associated with an interaction between their vQTL(s) and 518 disparate environmental factors (see “Methods” for phenotype selection and quality control criteria). Summary data for these phenotypes are shown in Supplementary Data 9. Briefly, phenotypes were selected through a combination of systematic selection criteria (using the GWAS Catalog25) and further manual curation to ensure that a broad range of relevant phenotypes were considered. Variance QTLs were first queried against the GWAS Catalog. Phenotypes that associated with these variants at p < 5 × 10−8 in the GWAS Catalog were extracted and included in GEI tests if they were also present in the UK Biobank database. Additional continuous phenotypes that were previously examined by Westerman et al.17 were included, as well as possibly relevant lifestyle or metabolic traits. Possible sources of variance heterogeneity other than GEIs, such as epistasis, phenotype preparation and phantom vQTLs are explored in later sections.We observed 1168 GEIs at a Bonferroni-adjusted threshold of p < 1.4 × 10−7 when vQTLs were used as index variants (p < 0.05 adjusted for 677 variants and 518 exposures). Significant GEIs comprised 15.4% of the vQTLs tested (104 variants) and reflected 101 unique proteins (Supplementary Data 10). The main effect QTL strategy returned 3061 GEIs at the same threshold. However, these associations encompassed only 2.8% of main effect QTLs (244 of 8856 variants, Supplementary Data 11). The proportion of vQTLs that participated in a GEI effect was 5.5-fold higher than the proportion of pQTLs, and this estimate is in line with previous findings15,17.Many of the exposures tested were from highly correlated categories (e.g. physical and lipid traits). Therefore, to identify ‘independent’ GEI associations, we undertook stepwise conditional analyses for each protein. Here, GEI associations for each protein were iteratively conditioned on the most significant association for that protein. GEIs that failed to survive Bonferroni correction once conditioned on the most significant association were removed. The process was repeated using the next most significant association for that protein until no further phenotypes could be considered (see Methods). These analyses do not point towards biological pathways or potentially underlying biological effects. They were carried out to help account for the correlation structure between GEIs at a given locus. Conditional analyses suggested that 130 of the 1168 GEIs with vQTLs were ‘independent’ when further accounting for correlations between phenotypes (Supplementary Data 12). Twenty proteins exhibited two or more conditionally significant associations with a maximum of 4 conditional GEIs for FCGR2A, FUCA1 and LDLR (Fig. 4a). Conditional GEIs included 53 unique phenotypes. In total, 115/130 GEIs were nominally significant (p < 0.05) and directionally concordant with estimates from the replication set (Supplementary Data 13).Fig. 4: Gene-environment interactions underlying the plasma proteome are pervasive.a Interaction Z-scores are highlighted only for proteins with two or more conditionally significant GEI effects. Positive Z-scores are shown in green and negative Z-scores are shown in purple. Supplementary Data 14 shows direct associations between protein levels and their respective phenotypes displayed in (b) (i.e. not stratified by genotype). b Illustrative example: mean transformed LDL receptor or LDLR levels (closed circles) with 95% confidence intervals (vertical bars) when stratified by vQTL genotype (rs75627662) and tertiles of measured blood LDL cholesterol. Data are presented as mean values ±95% confidence intervals. GEI gene-environment interaction, LDL low-density lipoprotein, vQTL variance quantitative trait locus.GEIs implicate known biological pathways and are influenced by age and sexGEIs revealed a number of interactions consistent with known biology. Thirty-eight conditional GEIs involved phenotypes that were selected via look-up analyses in the GWAS Catalog25 (Supplementary Data 16). These associations entailed variants with known main effects on the exposure (i.e. phenotype) and also on the mean and/or variance of the protein’s measured abundance in blood. There were also known correlations between the protein level and phenotype, including an association between LDL receptor (low-density lipoprotein, LDLR) and LDL cholesterol26 (Fig. 4b). Our GEI analyses supplements existing knowledge by showing genotype-dependent effects for these associations. For instance, the trans variant rs75627662 located near APOE had an additive effect on transformed LDLR levels and served as a vQTL. There was a strong positive correlation between LDLR and cholesterol concentrations in major (C)-allele carriers (rhet = 0.32, 95% CI = [0.31, 0.34], p = 1.2 × 10−259; rmajor = 0.42, 95% CI = [0.41, 0.43], p = 1.0 × 10−300; Pearson’s correlation). The association was ablated in those homozygous for the minor T-allele (rminor = 0.01, 95% CI = [−0.05, 0.06], p = 0.85). This underscored a clear GEI that was uncovered by the two-stage strategy. Other GEIs included associations between interleukin-6 and C-reactive protein levels (Supplementary Data 16). Interleukin-6 stimulates the synthesis of C-reactive protein27. We also detected a GEI whereby the effect of a cis variant on oxytocin levels differed according to sex. Oxytocin shows sex-dependent effects on a range of physical and behavioural traits28.Thirty-four vQTL associations were captured by GEIs involving age or sex. Furthermore, 391 of the 1038 GEIs (37.7%) that did not surpass conditional significance thresholds were removed after being conditioned on age or sex. Most of these associations involved phenotypes with clear sex differences. For example, circulating levels of PAEP (progestogen-associated endometrial protein, or glycodelin) showed an initial GEI with body weight. A GEI effect was not present in males or females when considered separately, but was detected when both were considered together. It was observed only because males had both higher PAEP levels and higher body weights at the minor T-allele of the cis variant rs697449, giving the appearance of a GEI (Supplementary Fig. 2). By contrast, we observed that age could mask rather than induce associations. PAEP was the most susceptible protein to this effect, and the effect was restricted to female participants. Negative correlations between PAEP levels and body weight were observed across genotypes in the youngest age group (40-50 years). Positive correlations were observed in the oldest age group (60–70 years, Fig. 5a). These opposing associations masked one another in the main analyses. In terms of clinical relevance, we also detected negative associations between PAEP levels and body weight in those who self-reported not having experienced menopause (mean age = 46.3, sd = 4.4 years, Fig. 5b). There were positive correlations in those who had experienced menopause (mean age = 60.5, sd = 5.4 years, Fig. 5b), and in those who had undergone a hysterectomy (mean age = 58.7, sd = 6.8 years, Supplementary Fig. 3). In all instances, the associations were strongest in carriers of the T-allele. Therefore, the allele underscored a stronger negative relationship between its protein and body composition before menopause, and a stronger positive relationship following menopause, when compared to the opposite allele. We also observed that PAEP levels were, on average, lower in older participants and by extension, those who had experienced menopause, mammograms and hormonal replacement therapy (Supplementary Figs. 4–6). Body weight was not altered by age (Supplementary Fig. 7). Together, these lines of evidence suggest that the role of glycodelin in the homoeostatic landscape may become altered in the context of menopause.Fig. 5: Relationships between glycodelin levels and body weight are influenced by genotype and menopause history.a Mean transformed glycodelin levels (PAEP, closed circles) with 95% confidence intervals (vertical bars) are shown according to tertiles of body weight (in kilograms, kg) and rs697449 genotype. Data are presented as mean values ±95% confidence intervals. The associations are stratified according to three decades of life in females. b Correlations between PAEP levels and body weight are shown in green (left-side) for participants who reported ‘No’ to the question ‘Had menopause?’ at the study baseline (Field: 2724). Correlations are shown in purple for participants who reported ‘Yes’ to the same question. Correlations are stratified further by genotype and evidence that the T-allele underscores a stronger correlation in participants who had and also who had not experienced menopause. The statistical test used was Pearson’s correlation (two-tailed).Gene-environment interactions capture why some vQTLs lack genetic main effects on protein levelsvQTLs may lack genetic main effects if the variant positively correlates with protein levels in one stratum of an environmental exposure and negatively in other strata. The opposing effect sizes preclude a marginal main effect and produce a ‘directionally discordant’ GEI17. A slightly higher proportion of vQTLs without main effects at p < 3.4 × 10−11 participated in conditional GEIs compared to those with main effects on protein levels (14/67 or 20.9% versus 90/610 or 14.8%, respectively). However, 13 of the 14 vQTLs (92.9%) without additive main effects were involved in directionally discordant interactions, compared to only 2/90 (2.2%) of those with additive main effects (Supplementary Data 16). Indeed, directionally discordant interactions likely explained why 13 of the wider 67 vQTL-only sites lacked additive main effects. Associations between a given variant and protein level may not have been significant in each strata of an exposure. However, the requirement for a discordant interaction was that the effect size must have been opposite in direction in at least two strata. All 13 vQTL-only sites that are captured by such discordant GEIs are described in full in Supplementary Information. An illustrative example is highlighted below.The trans indel in FLT3 (rs147072313, fms-related tyrosine kinase 3) associated with the variance of its ligand (FLT3LG). The binding of FLT3LG to cell-surface FLT3 promotes monocyte proliferation (Fig. 6a)29. FLT3LG levels associated with a genotype-by-monocyte count interaction in this sample (p = 9.3 × 10−11). rs147072313 positively correlated with FLT3LG levels within those assigned to the highest tertile of monocyte counts (β = 0.09, se = 0.02, p = 4.1 × 10−6). However, there was a negative correlation in the lowest tertile (β = −0.04, se = 0.02, p = 0.01), precluding a main effect (Fig. 6b). The additive main effect was almost null (p = 0.96, shown previously in Fig. 2a). This illustrates how stratification across levels of an environmental exposure can prevent the observation of a genetic main effect. The exposure that was likely responsible for this stratification was delineated via our GEI analyses.Fig. 6: Variance QTL in FLT3 receptor gene modifies the relationship between FLT3 ligand levels and monocyte count.a Schematic diagram showing the role of FLT3 receptor and FLT3 ligand (FLT3LG) binding in monocyte proliferation. b Linear regression was used to examine the relationship between rs147072313 genotype and FLT3LG levels across tertiles of monocyte count (two-sided). Mean transformed FLT3LG levels (closed circles) with 95% confidence intervals (vertical bars) are displayed according to tertiles of monocyte count and rs147072313 genotype. Of note, the wide confidence intervals in those homozygous for the indel reflect the small sample size of this group. They represent confidence intervals for the mean of the protein levels across strata. They do not reflect the variance of the protein level, which is instead visualised in (c) and is most clearly illustrated via boxplots in Fig. 2a. c displays full distributions of FLT3LG levels and their correlation with monocyte counts. c The statistical test used was Pearson’s correlation (two-tailed). The correlation between transformed FLT3LG levels and monocyte count differs according to genotype at rs147072313 giving rise to a gene-environment interaction. Here, it is also apparent that the vQTL confers reduced variance in FLT3LG in carriers of the indel. d The association between circulating FLT3LG levels and FLT3 receptor levels is consistent across genotypes. The statistical test used was Pearson’s correlation (two-tailed). (e) Violin plots show that rs147072313 genotype does not have an additive main effect on monocyte count. Centre line of boxplot: median, bounds of box: first and third quartiles and tips of whiskers: minimum and maximum. QTL, quantitative trait locus. Figure 6a Created with BioRender.com released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.Figure 6C shows that the indel was associated with decreased variance of transformed FLT3LG levels (p = 1.1 × 10−15, also visualised previously in Fig. 2a). In relation to the GEI effect, it revealed a weak negative correlation between measured FLT3LG levels and monocyte count in major allele carriers (T-allele, rmajor = −0.03, 95% CI = [−0.04, −0.02], p = 3.2 × 10−6). A weak positive correlation was observed in heterozygotes (rhet = 0.04, 95% CI = [0.02, 0.06], p = 1.1 × 10−4) and a stronger positive correlation was observed in indel homozygotes (rminor = 0.12, 95% CI = [0.06, 0.18], p = 2.6 × 10−4, Fig. 6c). This suggested that the effect of FLT3LG on monocyte count could be linked to rs147072313 genotype. However, the relationship between FLT3LG and FLT3 levels was consistent across genotypes, which suggested that ligand-receptor binding was preserved (Fig. 6d). Genotype also did not associate with mean differences in monocyte count (β = −0.003 per T-allele, se=0.002, p = 0.13, Fig. 6e). Therefore, the indel was not associated with an overall detrimental effect on monocyte count. The relationship was specific to monocytes when considering eight other blood cell types (Supplementary Fig. 8).Alternative explanations for variance QTLs underlying the plasma proteomeGEIs are sufficient but not necessary to generate a vQTL15. It is also not possible to draw a causal interpretation between tagged genetic variants, environmental exposures and variance heterogeneity. Furthermore, vQTL associations may reflect epistatic (or gene-gene) interactions, phantom vQTLs and artefacts of statistical transformations applied to phenotypes. To explore these alternative explanations, we first examined whether protein levels associated with an interaction between a vQTL and any other SNP located more than 10 Mb away on the same chromosome (see Methods, Supplementary Data 17). Of the 677 vQTL associations, 5 showed an epistatic interaction effect on their respective protein level at p < 5.0 × 10−8. This significance threshold was selected as we estimated that there were approximately 1 million independent epistasis tests across our analyses (p < 0.05 adjusted for 1 million tests).Another important consideration in vQTL analyses is the relationship between the mean and the variance of phenotypes16. Differences in the variance of protein levels could be explained by additional linked main effect SNPs or QTLs24,30. This is of particular concern for traits with QTLs of large mean-effect sizes, such as molecular phenotypes (e.g. protein levels and DNA methylation24), but less so for complex traits15. Therefore, where possible, we additionally adjusted each protein that had a vQTL association for the most significant mean-effect SNP at p < 3.4 × 10−11 from Sun et al.6. We then repeated each vQTL association test and observed that 451 (66.6% of 677) associations survived Bonferroni correction at p < 3.4 × 10−11 (Supplementary Data 18). Therefore, 226 associations could be captured by linkage to main effect QTLs. Of note, epistasis effects may also arise from linkage to main effect QTLs. Furthermore, these associations are indicative of phantom vQTLs but this cannot be ascertained without the use of whole genome sequencing data. Therefore, the 226 associations highlighted in these sensitivity analyses can more accurately be described as statistical artefacts. Together, 301 of all 677 vQTL associations were captured by GEIs, epistasis or statistical artefacts.Rank-based inverse normal transformation (RINT), implemented in this study, may reduce the correlation between mean and variance effects31 but such non-linear transformations can inflate the type I error in Levene’s test in the presence of QTL effects. Therefore, as further sensitivity analyses, we re-estimated vQTL associations when preparing phenotypes in line with guidelines from Wang et al. to avoid these potential biases (see Supplementary Information)15. 671 variants remained at p < 3.4 × 10−11 with the remainder at p < 8 × 10−11 (Supplementary Data 19). Supplementary Information details further sensitivity analyses in which protein levels were adjusted for the exposures they associated with in stage two of our study design. vQTL association statistics were re-estimated when adjusting for participating exposures. Relative effect sizes were correlated 99% with those from the main analyses (Supplementary Data 20). Therefore, vQTL associations were largely robust to statistical transformations of phenotypes.

Hot Topics

Related Articles