Genome-wide discovery for biomarkers using quantile regression at biobank scale

Overview of quantile regression techniquesWe describe here briefly the quantile regression model as a complementary strategy to the commonly used linear regression model in GWAS. More details are in the Methods section. We assume that we have n independent samples from a population. We denote by \({\bf{Y}}=({Y}_{1},\cdots \,,{Y}_{n})^{\prime}\) the n × 1 sample phenotype vector, by X the n × p genotype matrix, and by C the n × q matrix for covariates. We denote the τth conditional quantile function of Y as QY(τ∣X, C). Then we can write the conditional quantile regression model for jth variant and for a specific quantile level τ ∈ (0, 1) as$${Q}_{{\rm {Y}}}(\tau | {{\bf{X}}}_{j},\,{\bf{C}})={{\bf{X}}}_{j}{{\boldsymbol{\beta }}}_{j}(\tau )+{\bf{C}}{\boldsymbol{\alpha }}(\tau ),$$
(1)
where βj(τ) and α(τ) are quantile-specific coefficients and can be estimated by minimizing the pinball loss function18 as implemented in the R package quantreg. For testing H0: βj(τ) = 0, a commonly used tool in quantile regression is the rank score test6,28, implemented in the R package QRank. The rank score test statistic for a fixed quantile level τ is defined as$${{\bf{S}}}_{{\rm{QRank}},j,\tau }={n}^{-1/2}\mathop{\sum }\limits_{i=1}^{n}{{\bf{X}}}_{ij}^{*}{\phi }_{\tau }({{\bf{Y}}}_{i}-{{\bf{C}}}_{i}\hat{{\boldsymbol{\alpha }}}(\tau ))\,{\rm{and}}\,{{\bf{V}}}_{{\rm{QRank}},j}={n}^{-1}\tau (1-\tau ){{\bf{X}}}_{j}^{*^{\prime} }{{\bf{X}}}_{j}^{*}.$$
(2)
where X* = PCX, \({{\bf{P}}}_{{\rm {C}}}={\bf{I}}-{\bf{C}}{({\bf{C}}^{\prime} {\bf{C}})}^{-1}{\bf{C}}^{\prime}\), and ϕτ(u) = τ−I(u < 0) is the τ-pinball loss. Under the null hypothesis H0: βj(τ) = 0, \({{\bf{V}}}_{{\rm{QRank}},j}^{-1/2}{{\bf{S}}}_{{\rm{QRank}},j,\tau } \sim N(0,1)\). Note that the asymptotic distribution of the test statistics is independent of the particular distribution of the phenotype, and hence the rank score test can be applied to any phenotype without requiring a pre-transformation to achieve normality. Throughout this paper, we use nine quantiles spaced equally at 0.1 intervals and combine quantile-specific p-values via Cauchy combination29.QR is powerful across different homogeneous and heterogeneous modelsWe perform simulations based on UKBB genotype data on 20,000 white British unrelated individuals and focus on one locus (chr8:19,291,998–20,291,997; GRCh37/hg19) with 4148 variants. We assume only one causal variant selected at random among variants with minor allele frequency (MAF) > 1% and with varying effect size: 0.01 ≤ β ≤ 0.2. We simulate phenotype data using several models, including a homogeneous linear model (with effects being constant across quantiles) with the error distribution being normal or a heavy tail Cauchy distribution; a location-scale model; a local model where effects only exist at the upper quantiles; and a model with both additive and dominance effects. Trait is normalized using rank-based inverse normal transformation for LR only; raw trait values are used for QR and QUAIL. We assess the power of LR, QUAIL, and QR based on 100 Monte Carlo replicates. Note that the type-1 error is well controlled under both normal and Cauchy error distribution at individual quantile levels (Table S1). There is slight inflation at the 0.05 level when using the Cauchy combination method to combine p-values across quantile levels, due to the fact that the Cauchy combination method is known to be most accurate in the tail of the null distribution29, and therefore smaller p-values are approximated better. Given that in GWAS, we are interested in identifying variants at stringent significance levels, the Cauchy combination method should be an appropriate strategy to combine p-values.Homogeneous linear model with normal or Cauchy error distributionwe assume the phenotype Y follows the model:$${\bf{Y}}=\beta {\bf{X}}+{\boldsymbol{\epsilon }},$$
(3)
where X is the genotype at the causal variant, ϵ ~ N(0, 1) or ϵ ~ Cauchy(0, 1). For the normal error, LR has a slight power increase over QR (Fig. 1a), while for the Cauchy error (heavy tail distribution), QR has higher power than LR (Fig. 1b). QUAIL has no power in these scenarios.Fig. 1: Power results in simulations.a Homogeneous model, normal error distribution; b homogeneous model, Cauchy error distribution; c location-scale model; d local model; e model with additive and dominance effects. Power is shown when beta varies from 0.01 to 0.2. For the model with additive and dominance effects, the dominance effect is fixed at 0.2, while the additive effect varies from 0.01 to 0.2. Figures on the left show densities of the raw trait values by genotype group for one replicate. Figures in the middle show the empirical conditional quantile functions for τ = 0.1−0.9 for the same replicate for beta = 0.1 (betaA = 0.1 and betaD = 0.2 for the additive + dominance model). Statistical significance was determined by the QR rank score test described in the Methods section. Shown in black is also the LR line fitted to the data.Location-scale modelwe also simulate the phenotype Y from the following model:$${\bf{Y}}=\beta {\bf{X}}+(1+0.5\beta {\bf{X}}){\boldsymbol{\epsilon }},$$
(4)
where ϵ ~ N(0, 1). In this model, X is associated with both the mean and the variance of Y. QR has the highest power in this scenario, followed by QUAIL, while LR has the lowest power, as expected (Fig. 1c).Local model with effects only at upper quantileswe additionally simulate data under a local model with effects only at the upper quantiles τ ∈ [0.7, 1]:$${Q}_{{\rm {Y}}}(\tau | {\bf{X}})={\boldsymbol{\beta }}(\tau ){\bf{X}}+{{\boldsymbol{\alpha }}}_{0}(\tau ),$$
(5)
where β(τ) = 5β(τ−0.7)/(1−0.7) when τ > 0.7 and β(τ) = 0 otherwise. The error distribution (\({{\boldsymbol{\alpha }}}_{0}^{-1}(\tau )\)) is assumed to be normal. Such a model can correspond to a gene–environment interaction, where a genetic variant only shows an effect on a phenotype in a certain environment. Under this local model, QR can have significant increase in power over LR (Fig. 1c). QUAIL has similar power as QR.Model with both additive and dominance effectsfinally, we consider a model that includes both additive and dominance effects as follows:$${\bf{Y}}={\beta }^{{\rm {a}}}{{\bf{X}}}^{{\rm {a}}}+{\beta }^{{\rm {d}}}{{\bf{X}}}^{{\rm {d}}}+{\boldsymbol{\epsilon }},$$
(6)
where Xa corresponds to the additive coding 0, 1 and 2, while Xd is coded as \(-p/(1-p),\, 1,-\left.(1-p)/p\right)\) with p being the MAF, and ϵ ~ N(0, 1). When the dominance effect is strong enough, QR can indeed be more powerful (Figs. 1d, S1). QUAIL overall has lower power than both LR and QR.Applications to UKBB GWASTo illustrate QR in a modern GWAS setting and contrast the results with LR and QUAIL, we use the UKBB data (between 264,854 and 325,825 white British, unrelated/independent individuals and  ~8.6 million SNPs), and perform GWAS for 39 quantitative traits (Fig. S2). We adjust for several covariates, including age, sex, batch, and 10 principal components (PC) of genetic variation. We identify significant SNPs using a genome-wide significance threshold of 5 × 10−8 for each trait (Fig. S4–S13). We define a locus as a 1Mb segment around the top p-value SNP in a region. LR detects a larger number of significant associations relative to QR (2,039,518 genome-wide significant SNPs and 9964 loci vs. 1,517,669 SNPs and 7297 loci). Most of the QR discoveries are shared with LR, but QR also identifies new loci that LR misses (Fig. 2a, S2, and Supplementary Data 1). In contrast, QUAIL detects a few significant associations across these 39 biomarkers, most of which are already detected by QR (Fig. S3). Specifically, on average, only 9% (ranging from 0% to 33%) of the QR associations were also detected by QUAIL.Fig. 2: Number of genome-wide significant loci identified by LR and QR for each of 39 traits, effect size heterogeneity, and functional score comparison.a Number of genome-wide significant loci. Stacked barplots show the corresponding number of significant loci for LR&QR, LR–QR, and QR–LR. b Densities of heterogeneity indexes for GWAS associations for 39 traits. Densities are shown for three sets of associated SNPs: LR&QR, LR–QR, and QR–LR. c Functional scores (EigenPC and CADD) for the associated SNPs detected by QR–LR compared to those from LR–QR. Each box plot represents the median (line inside the box) and interquartile range (box top and bottom) and extends to 1.5 times the interquartile range (whiskers). One-sided p-values (QR–LR > LR–QR) from the Kolmogorov–Smirnov test are also shown comparing the functional effects for 42,291 vs. 564,140 associations identified by QR–LR vs. LR–QR.We replicate the genome-wide significant associations above in an independent dataset of European individuals from UKBB (n ≈ 34,000–39,000). Given the much smaller sample sizes for the replication datasets relative to the discovery datasets, we report replication at nominal significance level (p-value < 0.05). One challenge with QR is that the result is a composite of results across multiple quantile levels and the p-value is obtained by Cauchy combination of individual quantile-level p-values which makes it more difficult to take the sign of the associations into account when considering replication. To overcome this challenge, we used the sign of the estimated effect β(τ) for the quantile with the lowest QR p-value in the discovery dataset and compared it to the sign of the estimated effect for the same quantile level in the replication dataset. QR has replication rates at least as high as LR (68% vs. 62% on average across the 39 traits, Fig. S14, Table S2). For both LR and QR, sign consistency tends to be very high (≈96%). Note also that most of the lead QR variants at loci only identified by QR still have suggestive signals in LR, suggesting that many of them could be identified in larger GWAS (Fig. S15).Since QR allows us to estimate quantile specific effects across a whole range of quantiles, we can quantify the degree of heterogeneity in effects across the quantiles for individual SNPs. Specifically, for a SNP we compute a heterogeneity index as the log transformed ratio between the standard deviation and the absolute mean of the estimated quantile coefficients β(τ)’s: \(\log | {\rm{sd}}(\beta (\tau ))/{\rm{mean}}(\beta (\tau ))|\). The higher the value, the more heterogeneity in the quantile effects. As expected, SNPs identified by QR but missed by LR (QR p-value < 5e−08, LR p-value > 5e−08) are the most heterogeneous (Fig. 2b). Furthermore, they have significantly higher functional scores relative to those only identified by LR (Eigen-PC30 p = 0.031, Fig. 2c). Together with the higher replication rates and the high sign consistency mentioned above, these results suggest that the new QR-LR loci are at least as reliable as the loci identified only by LR.Next, we focus on two traits, BMI and LDL, to provide more specific results. Both traits have a right-skewed distribution. For BMI, QR identifies 188 loci, of which 172 are shared with LR, and 16 are unique to QR (Figs. 3a and S16, Table S3). Variants identified by QR but missed by LR have heterogeneous effects across quantile levels as expected, with larger effects at the upper quantiles (Fig. S17), likely due to the right skewness of the BMI distribution. For LDL, QR identifies 93 loci, with 87 of them being shared with LR, and 6 being unique to QR (Fig. 3b and S18, Table S3). As with BMI, variants identified by QR but missed by LR have larger effects at the upper quantiles (Fig. S17). Moreover, variants identified by QR but missed by LR have more heterogeneity (Fig. S19a) and higher functional scores (Fig. S19b) relative to those only identified by LR.Fig. 3: Manhattan sunset plot for BMI and LDL.\(-{\log }_{10}(p\,\text{-value}\,)\) of the genome-wide significant associations (p-value < 5 × 10−8) detected by the LR (blue bars) and QR (red bars) for a BMI and b LDL are plotted against the genomic positions. The lead QR association at a locus detected by QR–LR is represented by a red diamond with the name of the closest protein-coding gene indicated.Next, we illustrate the advantages of QR when there is effect heterogeneity by showing results at two loci detected by QR but missed by LR for two traits, BMI and LDL. For BMI, we show the locus corresponding to the lead QR SNP rs12037905 (Fig. 4a), which falls within LYPLAL1-antisense RNA1 (LYPLAL1-AS1), a long noncoding RNA gene, which has been found previously to be associated with BMI31 and lipid traits such as high-density lipoprotein (HDL) cholesterol and triglycerides (TG) levels32. Yang et al. described a mechanism by which long noncoding RNA regulates adipogenic differentiation of human mesenchymal stem cells and suggested that LYPLAL1-AS1 may serve as a novel therapeutic target for preventing and combating diseases related to abnormal adipogenesis, such as obesity33.Fig. 4: Examples of QR–LR and QR& LR significant loci for BMI and LDL.a and b QR–LR significant loci; c and d QR& LR significant loci. The regional plots show \(-{\log }_{10}(p\,\text{-value}\,)\) of QR with the lead QR SNP per locus shown as diamond. Colors indicate the LD (r2) to the lead SNP for each variant. For the lead QR SNP at each locus, the estimated quantile specific effects ± SE are shown as dots and error bars in the middle panels, and the empirical conditional quantile functions for τ = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 quantiles of BMI and LDL by genotype group (one gray dot per individual) are shown in the right panels. Sample size n = 325,306 (BMI) and 310,574 (LDL). Statistical significance was determined by the QR rank score test described in the “Methods” section.For LDL, we show the lipoprotein lipase (LPL) locus with lead QR SNP rs326 (Fig. 4b), which is an intron variant within LPL. LPL, encoding lipoprotein lipase, plays an important role in the clearance of TG and the maturation of HDL. A higher concentration of serum LPL is associated with a lower/higher level of TG/HDL cholesterol and contributes to a reduced risk of coronary artery disease (CAD)34,35. rs326 at LPL is a pleiotropic variant associated with CAD36 and lipid metabolism37. Large lipid genetic studies have shown that the rs326 G allele is associated with increased HDL cholesterol levels and decreased TG, suggesting a protective effect for CAD risk37. In our QR analyses, the rs326 G allele is significantly associated with decreased LDL at upper quantiles (Fig. 4b), suggesting a protective effect against CAD risk among individuals with higher levels of LDL.The advantage of QR is not restricted to identifying novel loci, missed by LR. QR can provide additional insights even for loci detected by LR. We show here two examples corresponding to well known associations: the FTO locus and BMI, and the PCSK9 locus and LDL. These loci have strong associations from the conventional LR approach, yet QR provides additional information on the heterogeneity of effects across different quantile levels beyond what one can get from LR (Fig. 4c, d). In particular, the rs11642015 T allele at the FTO locus (intron variant) has higher effects at the upper quantiles of the BMI distribution, while the rs11591147 T allele at the PCSK9 locus (missense variant) has stronger negative (protective) effect at the upper quantiles of the LDL distribution. Both variants have been reported before as likely to play a functional role38,39. In particular, the FTO rs11642015 resides in an enhancer that regulates the expression of the neighboring IRX3/IRX5 cluster through long-range interactions with the IRX3/IRX5 gene promoters, and was shown to repress mitochondrial thermogenesis in adipocyte precursor cells40.Heterogeneous associations identified by QR could be the result of gene-environment interactions. Even though gene–environment interactions are believed to be common, detecting them is challenging due to several factors, including lower statistical power relative to detecting main effects, the low prior probability of specific gene–environment interactions, etc. Furthermore, the relevant environmental exposures can be missing from the study, and even when measured, accurate measurements of exposure may not be available. The significant SNPs identified by QR can be subsequently tested for interactions with particular environmental factors using conventional gene–environment tests. Note that this is a two-stage design: first, we identify significant SNPs from a QR scan (only for SNP effect, without any environmental factor in the model), and second, we test for interactions between these SNPs and particular environmental factors using well established gene–environment tests with Bonferroni method to adjust for the number of gene–environment tests in the second stage (note that the two stages are independent).We selected five environmental factors that are available in the UKBB, including sex, age, physical activity (PA), sedentary behavior (SB) and smoking, and tested for interactions with top SNPs detected by QR. Overall, using lead QR SNPs at 7297 loci across 39 traits we identify 139 significant gene-environment interactions (significant after multiple testing adjustment for tests for each phenotype), 58 of which are significant at the study-wide level (p < 0.05/(7297 × 5) = 1.37e−06), after adjusting for all gene–environment tests across all phenotypes (Table S4 and Fig. 5a). To test the enrichment of gene–environment interaction effects among the lead QR variants, we repeated the gene–environment tests using the same number of randomly selected significant variants (LR or QR) across 39 traits. Over 1000 replicates, we observed, on average, 16 significant gene–environment interaction effects, significantly lower than the 58 observed interactions with leading QR variants (Fig. 5b). If we focus on QR–LR loci, only two study-wide significant interactions are detected among 259 × 5 tests: ARHGEF3 with age for platelet distribution width (p = 9.09e−16), and RSPO3 with sex for waist circumference (p = 7.49e−07). Note that this is still an 11-fold enrichment over the 2 study-wide significant interactions detected at the LR-QR loci among 2926 × 5 tests (p = 0.03). Similar results hold if we restrict to BMI and LDL (Fig. S20), for which most of the significant interactions are with age or sex. For example, for LDL we detected significant interactions between variants in PCSK9, APOB, CELSR2, LPA, and age. For BMI, we identified significant interactions between a variant in FTO and physical activity, and a variant in FAIM2 and sedentary behavior.Fig. 5: Significant gene–environment interaction effects of the lead QR variants for 39 traits.a The heatmap shows \(-{\log }_{10}(p\,\text{-value}\,)\) of gene-environment interaction tests for the lead QR association at the loci detected by QR. * denotes nominally significant GxE interaction effects (p-value  < 0.05) and ** denotes significant effects after Bonferroni correction (p < 0.05/(5 ⋅ 7, 297) = 1.37e − 06). Sample size by trait can be found in Table S4. b The empirical distribution of the number of significant gene-environment interactions when variants are chosen randomly among significant variants identified by LR or QR.We attempted to replicate the study-wide significant gene-environment interactions above in an independent dataset of European individuals from UKBB (Table S4). There is strong correlation among estimated effect sizes from the discovery and replication (r = 0.77) with almost perfect sign consistency with 56 out of 58 interaction effects showing the same sign (p < 2.2e − 308). In terms of p-value, 25 out of 58 tests are nominally significant (p < 2.2e − 308), with 6 out of 58 being significant after adjusting for 58 tests, including variants in TCF7L2 and age for Glycated hemoglobin (HbA1c), variants in SMOX and sex for High light scatter reticulocyte count, Reticulocyte percentage, and Immature reticulocyte fraction, variants in CSP1 and sex for Mean platelet (thrombocyte) volume. Similar results are obtained when relaxing the list of discovered interactions to the ones that are significant for each trait individually Supplementary Data 2.As mentioned above, one highly significant interaction was detected with rs1354034, a SNP with heterogeneous quantile effects that was completely missed by LR (Fig. 6). Interestingly, a strong non-additive association was previously observed for platelet distribution width at the same variant rs1354034, associated with increased expression of ARHGEF3 in platelets, the association being highly specific to platelets and not detected in other major blood cell types41,42. Platelet distribution width describes how similar the platelets are in size, with high values meaning great variation is size and being associated with vascular disease or certain cancers. The interaction of rs1354034 with age may be indicative of increased platelet activation with age which leads to more variable platelet width43. Specifically, at younger age there are low levels of platelet activation and the risk genotype may be silent. As platelet activation increases with age, the genotype expresses itself, which leads to stronger age-phenotype associations in the presence of the risk genotype. rs1354034 is a putative functional variant and has been studied before in connection with the platelet-related phenotypes. In particular, rs1354034 disrupts a conserved GATA motif and resides in a regulatory element that has previously been shown to affect the transcription of nearby ARHGEF344. Additionally, rs1354034 is associated in trans with the expression of von Willebrand factor (VWF) and other key platelet/megakaryocyte genes found at other loci45.Fig. 6: Association of rs1354034 at the ARHGEF3 locus with Platelet distribution width.a \(-{\log }_{10}(p\,\text{-value}\,)\) of QR with the lead QR SNP per locus shown as diamond. b \(-{\log }_{10}(p\,\text{-value}\,)\) of LR with the lead QR SNP per locus shown as diamond. Colors indicate the LD (r2) to the lead SNP for each variant. c The estimated quantile specific effects (dots) and the SE (error bars) are shown at τ = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9. d The curves are the τ = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 empirical conditional quantile functions. Shown in black is also the LR line fitted to the data. e gene-environment interaction between rs1354034 and age. The statistics in a-e were derived from n = 316,707 individuals used in the platelet distribution width GWAS. Statistical significance was determined by QR rank score test in panels a-d and two-sided t-test in panel e.

Hot Topics

Related Articles