Integration of variant annotations using deep set networks boosts rare variant association testing

A more detailed exposition of the methods may be found in the Supplementary Methods.Ethics statementUKBB protocols are overseen by the UKBB Ethics Advisory Committee. Informed consent was obtained for all participants. Participants who revoked consent were removed from the analysis. The original approval for the UKBB was granted in 2011 by the National Research Ethics Service Committee North West—Haydock. The approval was renewed in 2016 and 2021 by the Health Research Authority, North West—Haydock Research Ethics Committee. This research has been conducted using the UKBB resource under applications 25214, 44108 and 81358.DeepRVAT modelModel and applicationsThe DeepRVAT gene impairment module is trained as part of an end-to-end multiphenotype prediction model. The pretrained gene impairment module can be leveraged in various downstream tasks, including RVAT and augmenting PRS with rare variant effects.Input dataThe input data for DeepRVAT consists of unordered sets of variants. The variant set for individual i and gene j is defined as follows:$${V}_{{ij}}=\left\{{\left({a}_{{kl}}\right)}_{l=1,\ldots ,d} \mid {\mathrm{variant}\; k\; \mathrm{present}\; \mathrm{in}\; \mathrm{individual}\; i},{\mathrm{gene}\; j}\right\},$$where akl is the lth annotation of variant k.Model architectureWe used a deep set architecture34. The variant set is first passed through a submodule φ, which computes a variant embedding \(\varphi \left({x}_{k}\right)\) for each \({x}_{k}=\left({a}_{k1},\ldots ,{a}_{{kd}}\right)\in {V}_{{ij}}\). Next, a gene embedding is computed using a fixed and permutation-invariant aggregation function f (we used the element-wise maximum), and then a second submodule ρ computes a scalar gene impairment score. Finally, we pass the result through a sigmoid function. In full,$$\psi \left({V}_{{ij}}\right)=\sigma \left(\rho \left(f\left({\left\{\varphi \left({x}_{k}\right)\right\}}_{{x}_{k}\in {V}_{{ij}}}\right)\right)\right),$$where both φ and ρ are multilayer perceptrons (MLPs), and σ is the sigmoid (logistic) function.Seed genesTraining of the gene impairment module begins by selecting, for each training phenotype, a set of trait-specific ‘seed genes’ from the set of all protein-coding genes. In this study, we base these on the results of alternative RVAT methods, specifically the ‘burden/SKAT combined’ method described below.Training objectiveDeepRVAT is trained in a multitask framework across multiple phenotypes. We estimate the pth phenotype \({y}_{i}^{\left(p\right)}\) for individual i as a linear combination of the gene impairment scores and covariates:$${\widehat{y}}_{i}^{\left(p\right)}={{\bf{x}}}_{i}^{T}{\bf{{\upalpha}} }^{\left(p\right)}+\sum _{j\in {S}^{\left(p\right)}}{w}_{j}^{\left(p\right)}\psi \left({V}_{{ij}}\right),$$where xi is the vector of covariates for individual i and S(p) is the set of seed genes for phenotype p. The parameters of ψ are shared across all variants, genes and phenotypes, while α(p) and \({w}_{j}^{\left(p\right)}\) are phenotype specific. The loss is given by the mean across phenotypes,$$L\left({y}_{i},{\widehat{y}}_{i}\right)=\frac{1}{P}\mathop{\sum }\limits_{p=1}^{P}\,L\left({y}_{i}^{\left(p\right)},{\widehat{y}}_{i}^{\left(p\right)}\right).$$Training-validation splitA data point for the DeepRVAT multiphenotype model was given by an individual–phenotype pair. Before training, data points were shuffled, and a validation set consisting of 20% of individuals was selected at random for each phenotype.Hyperparameters and softwareFor the variant embedding φ, we used a two-layer MLP with a width of 20. We used an MLP with two hidden layers of width 10 for ρ. In both networks, leaky rectified linear units with a negative slope of 0.01 were used as the activation functions. We used the mean-squared error (MSE) loss and the AdamW optimizer with a learning rate of 0.001. The batch size was 1,024. Early stopping was used to select the checkpoint with the lowest validation MSE loss. Training proceeded for a minimum of 50 and a maximum of 1,000 epochs. All DeepRVAT models were implemented in PyTorch (v1.13.1) and PyTorch-Lightning (v1.5.10).Ensembling and CV schemeThe dataset is first used for seed gene discovery using the ‘burden/SKAT combined’ method described below. Next, we partition the dataset across samples into K equally sized subsets (in this study, we used K = 5). We hold out one subset and train on all others, followed by computing gene impairment scores on the held-out samples using the trained model. This process is repeated for all subsets, resulting in a set of K models that can be used to estimate gene impairment scores for association testing, thereby avoiding information leakage without compromising sample size. Training within each CV fold is repeated for six random restarts, resulting in an ensemble of models. When estimating burden scores, all folds and random restarts that did not include the individual during training are averaged.Integration into single-marker association testing frameworksBecause DeepRVAT provides a single score per gene and sample, it can be seamlessly integrated into any tool that carries out single-marker association testing with genotype dosages. Practically, we implement this by providing a script that uses the bgen package (v1.6.1) to convert the (samples × genes) matrix of DeepRVAT scores to a BGEN file of pseudovariants. The DeepRVAT gene impairment score, ψ(Vij) is stored as the probability \({p}_{{ij}}=\left(\psi\left(V_{{ij}}\right),\mathrm{0,1}-\psi\left(V_{{ij}}\right)\right)\) of homozygous alternate, heterozygous and homozygous reference alleles so that the pseudodosage dij = 2ψ(Vij) lies in the usual range of [0,2].Alternative RVAT methodsBurden tests and SKATWe implemented burden and SKAT tests following ref. 22, using the score test from the SEAK package30 (v0.4.3). All combinations of burden and SKAT tests restricted to either pLOF or missense variants were carried out. We also created a combination test (burden/SKAT combined) using the full set of P values from all four individual tests. Variants were weighted with the betadensity function, that is, Beta(MAF; 1, 25). Ensembl VEP37 was used to annotate missense and pLOF variants, with the latter composed of stop gained, start lost, splice donor, splice acceptor, stop lost or frameshift variants. In SKAT tests, variants with minor allele count (MAC) ≤10 were collapsed as described in ref. 44. Due to computational constraints, we skipped genes with over 5,000 markers, impacting only one gene (Titin) for missense variant tests.STAARSTAAR tests were implemented using the STAAR package (v0.9.6) provided by the authors and following its vignette as well as the original publication28. To ensure optimal comparability with DeepRVAT, the same annotations as for DeepRVAT (described below) were used. As required for the STAAR procedure, each annotation was PHRED-scaled. Following ref. 28, STAAR P values were computed for five variant groups, namely (1) putative loss-of-function (stop gain, stop loss and splice), (2) missense, (3) disruptive missense, (4) putative loss-of-function and disruptive missense and (5) synonymous variants. We defined disruptive missense variants to be those that were predicted to be both ‘deleterious’ by SIFT14 and ‘probably damaging’ by PolyPhen216.Method by Monti et al.We used the same annotations, variant weight thresholds and variant kernel architectures as outlined in ref. 30. Annotation scores were obtained according to the details provided in Supplementary Table 3. We used the score test from SEAK to compute P values. In the case of missense and splicing tests, if either the collapsing or kernel-based association test yielded nominal significance (P < 0.01), we performed joint testing with pLOF variants. The P values from these tests were aggregated using the Cauchy combination method.REGENIEBurden and SKAT tests were run using both missense and pLOF masks, yielding four combinations. For burden tests, we used the default REGENIE strategy of collapsing variants to gene level using the maximum number of alternate alleles across sites. We used the approximate Firth likelihood ratio test for P values less than 0.01. Identically to the tests implemented in SEAK mentioned above, weights for SKAT tests were computed as Beta(MAF;1,25).Combination of multiple P values per gene and multiple-testing correctionThe burden/SKAT combined method, the method Monti et al.30 and STAAR each yielded multiple P values per gene. These were aggregated at the gene level using the Bonferroni procedure, and the smallest P value per gene was retained. For all methods, we again used Bonferroni correction across all 19,388 tested genes.Expected allele frequency (EAF) filteringFollowing ref. 22, we restricted testing to genes that passed an EAF filter of at least 50. The EAF is defined as CAF × n, where CAF is the cumulative allele frequency (the sum of allele frequencies of all qualifying variants j in the gene) and n is the cohort size for quantitative traits or the number of cases for binary traits.UKBB WES dataExome-sequencing dataWES (+100 bp overhang) was performed on 469,779 participants from the UKBB20, for which the methods have been described in the earlier release of data from approximately 50,000 individuals54. We refer to this as the UKBB 470k WES dataset and use the UKBB 200k WES dataset to refer to the interim release from 200,633 participants.Variant data and quality control (QC)For both cohorts mentioned above, variant calling data were downloaded from the UKBB as project-level VCF (pVCF) files. We applied additional QC following ref. 54 and using bcftools55 (v1.10.2)—minimum read depth of 7 for SNPs and 10 for InDels; at least one homozygous variant genotype or at least one sample per site with an allelic balance ratio greater than 15% for SNPs and 10% for InDels; fraction of missing genotypes <10%; and Hardy–Weinberg equilibrium P value > 10−15. Additionally, we filtered out individuals with >10% missing genotype rate and those who had withdrawn from the study. Finally, following the analysis best practices recommended by UKBB, we applied a coverage filter, requiring that at least 90% of all genotypes for a given variant have a read depth of at least 10. This resulted in datasets with 200,583 individuals and 12,704,497 variants (UKBB 200k WES), and 469,382 individuals and 26,141,967 variants (UKBB 470k WES).Custom sparse genotype data formatTo enable fast, multiple iterations over the WES datasets, we created a custom sparse dataset in Hierarchical Data Format 5 (HDF5 v1.10.6). Details are provided in the Supplementary Methods. For UKBB 200k WES, the HDF5 dataset had a storage size of approximately 100 GB, compared to multiple terabytes for the original pVCF files.CovariatesWe retrieved genetic sex, sample age and the first 20 genetic principal components (PCs) directly from UKBB (Supplementary Table 5). We computed age2 and age × sex to use as additional covariates. Covariates were included in association testing with all methods and when training DeepRVAT.Variant-to-gene assignmentsVariants were assigned to genes using those protein-coding genes and associated exons marked as golden in the merged Ensembl/HAVANA genome annotations (GENCODE release 38). We assigned a variant to a gene if it was located at most 300 bp from an associated exon.Variant annotationsThe full collection of variant annotations used and their sources are provided in Supplementary Table 3. Here we give details on processing for those annotations that were not used directly in the form output by the source.MAFMAF values for variants were first replaced with the maximum of the MAF in the UKBB cohort and in gnomAD release 3.0 (non-Finnish European population). Following ref. 23, the MAF pj of each variant j was then transformed according to the formula \({\left[{p}_{j}\left(1-{p}_{j}\right)\right]}^{-\frac{1}{2}}\) for use in modeling.DeepSEATo improve model fitting and avoid overfitting, we performed PC analysis and restricted to the first six PCs of DeepSEAs 919 predicted variant effects.SpliceAISpliceAI provides four ‘delta scores’ indicating a variant’s predicted effect on cryptic splicing40. We computed the maximum of these four scores and used it as a single annotation.AbSpliceWe computed the maximum predicted effect across tissues and used this as a single annotation.DeepRiPeAs in ref. 30, we predicted the effects of genetic variants on the binding of six RBPs over three cell lines.MAF thresholdsFor DeepRVAT training, we used variants with MAF < 1%. For association testing with all methods, we designated rare variants as having MAF < 0.1%. Additionally, for both training and association testing, we restricted to variants with PHRED-scaled CADD value >5.Phenotype dataAll phenotype data (Supplementary Table 4) were obtained directly from UKBB, except for WHR, which was computed as the ratio of UKBB data field 48 to data field 49 and corrected for body mass index by regressing out the corresponding data field 21001. Phenotype values were quantile transformed to match their empirical distributions to a standard normal distribution. For individuals with reported statin usage, we adjusted cholesterol (30690) by dividing by 0.8 and LDL-direct (30780) by dividing by 0.7, following refs. 56,57. Statins considered were obtained from ref. 58 and matched to UKBB treatment codes (20003). Binary traits were extracted using the definitions from ref. 21.DeepRVAT training and association testing on UKBB dataSubselected cohortsBecause the various methods used for benchmarking control for sample relatedness and population structure differently, or not at all, we retained only unrelated individuals of European genetic ancestry from the UKBB 200k WES dataset for DeepRVAT model training and benchmarking against alternative RVAT methods. Individuals related to third degree or closer were identified according to UKBB Resource 668. Individuals of European ancestry were identified using UKBB data field 22006 (termed ‘Caucasian’). This filtering resulted in a dataset (denoted UKBB 200k unrelated European ancestry below) of 161,822 individuals. For testing the integration of DeepRVAT with REGENIE (Fig. 4a, we additionally used all 167,214 individuals of European genetic ancestry.Training and gene impairment scoringSeed gene discovery and DeepRVAT training were carried out on the UKBB 200k unrelated European ancestry dataset. Training and gene impairment scoring were done according to the CV scheme described above. An ensemble consisting of all 30 models from the CV step (six ensemble models from five training folds) was used to compute gene impairment scores for the remaining 307,560 individuals from the UKBB 470k WES cohort.Association testingFor the method denoted DeepRVAT, association testing was carried out using the score test from SEAK (v0.4.3). Association testing for the method DeepRVAT + REGENIE was carried out with REGENIE (v3.4.1). Following the REGENIE documentation, for step 1, we selected approximately 500k (precisely, 483,446) imputed SNPs from UKBB data field 22828 using the following filtering: MAF < 0.06, MAC > 100, genotyping rate >0.99, Hardy–Weinberg P value ≥ 10−15 and sample missingness <0.1. Additionally, we pruned SNPs with a pairwise linkage disequilibrium r2 threshold of 0.9, using a window size of 1,000 and a step size of 100. Step 2 of REGENIE was run on DeepRVAT gene impairment scores for each gene, derived as described above. For quantitative traits, the default options of REGENIE were used. For binary traits, we used the approximate Firth likelihood ratio test with a P value threshold of 0.01. To account for multiple testing across genes, we applied Bonferroni correction for DeepRVAT and alternative methods.Comparison with other UKBB RVAT studiesWe compared our results to gene–trait associations from two studies20,22 on larger WES cohorts from UKBB (454,787 and 394,841 individuals, respectively). We counted as a discovery any association that was considered significant according to the methodology of the study. To compute replication for quantitative traits, we ranked all gene–trait associations by P value and, at each rank, counted the number of associations that overlapped with discoveries from the larger cohorts.Conditional association testsFor associations that were significant after multiple testing correction, we conducted conditional association tests using GWAS summary statistics from the Pan-UKBB study59. Independently associated variants were identified from GWAS summary statistics through LD-based clumping using PLINK60 (v1.9) with default parameters, restricting to associations with a P value < 10−7 and MAF > 1%. If a binary trait definition used in this study did not exactly match a single GWAS from Pan-UKBB, we combined P values from all relevant GWASs that covered parts of the trait definition before performing clumping. For association testing with SEAK in the method denoted DeepRVAT, independently associated variants within 500 kb around the gene boundaries were incorporated as covariates in the conditional analysis. For association testing with REGENIE (that is, DeepRVAT + REGENIE), all variants independently associated with a specific trait were considered for all genes.Phenotype prediction using DeepRVAT and alternative rare variant scoresDatasetTraining and evaluation of the regression models were done on two disjoint datasets, both restricted to unrelated individuals of European ancestry. A total of 154,966 (from UKBB 200k WES) and 224,817 individuals (from UKBB 470k WES, not found in UKBB 200k WES) were used for training and evaluation, respectively.PRS computationCommon PRS variants and effect sizes were all obtained from the Polygenic Score Catalog61 using the study from ref. 35. The catalog numbers of each common variant PRS are listed in Supplementary Table 4.Gene discoveryOnly the training individuals were used for gene discovery. For alternative burdens, we used the method ‘burden/SKAT combined’. Retaining associations at FWER < 0.05 resulted in a set of genes \({G}_{b}^{\left(\,p\right)}\) for phenotype p to use in the baseline prediction models. For DeepRVAT, gene discovery was conducted across all 33 traits of interest exclusively on training samples, following the method outlined above, using gene impairment scores obtained using the CV scheme. This yielded a set of trait-associated genes \({G}_{d}^{\left(\,p\right)}\) at FWER < 0.05.Burden and gene impairment scoresOn the test set, alternative burdens were computed as the maximum across all variants in a given individual and gene (excluding SIFT, where the minimum was used). DeepRVAT gene impairment scores for the test set were computed analogously to those used in association testing, using all models trained as part of the CV scheme on the evaluation set.Phenotype predictor training and evaluationFor simplicity, we describe models for predicting raw phenotype values. Prediction of extreme values is analogous, with logistic regression on the binary target replacing linear regression.BaselineAs a baseline phenotype predictor, we consider a regression model where the explanatory variables comprise covariates (as described above) and the common variant PRS score:$${\widehat{y}}_{i}^{\left(\,p\right)}={\bf\upalpha }^{T}{{\bf{x}}}_{i}+{\beta }_{c}^{\left(\,p\right)}{c}_{i}^{\left(\,p\right)},$$where \({c}_{i}^{\left(\,p\right)}\) is the common variant PRS score of sample i for phenotype p and, as given above, xi is the vector of covariates for sample i.Extension with rare variantsTo incorporate the effects of rare variants into the phenotype predictors, we extended the common variant PRS models by the rare burden scores of significant genes, with models incorporating DeepRVAT or alternative burdens given, respectively, by$${\widehat{y}}_{i}^{\left(\,p\right)}={\bf\upalpha }^{T}{{\bf{x}}}_{i}+{\beta }_{c}^{\left(\,p\right)}{c}_{i}^{\left(\,p\right)}+\sum _{j\in {G}_{d}^{\left(p\right)}}{\beta }_{j}^{\left(\,p\right)}{\psi }_{r}^{* }\left({V}_{{ij}}\right),$$$${\widehat{y}}_{i}^{\left(\,p\right)}={\bf\upalpha }^{T}{{\bf{x}}}_{i}+{\beta }_{c}^{\left(\,p\right)}{c}_{i}^{\left(\,p\right)}+\sum _{j\in {G}_{b}^{\left(p\right)}}{\beta }_{j}^{\left(\,p\right)}{s}_{{ij}}.$$The difference lies in whether DeepRVAT or alternative burdens sij are used, and additionally the burdens and learned gene weights \({\beta }_{j}^{\left(\,p\right)}\) range over either the ‘burden/SKAT combined’ gene set \({G}_{b}^{\left(\,p\right)}\) or the DeepRVAT gene set \({G}_{d}^{\left(\,p\right)}\).Model fittingThe linear and logistic regression models were fit in R (v4.2.0) using the functions lm and glm (respectively) from the stats package using the family binomial() for logistic regression models and otherwise retaining the default parameters.Phenotype predictor assessmentWe calculated the relative improvement of the model that leverages rare variant burdens compared to the common variant PRS-only model as$${\mathrm{Relative}}{{\Delta}} M=\frac{{M}_{{\mathrm{rare}}}-{M}_{{\mathrm{PR}}{\mathrm{S}}_{{\mathrm{only}}}}}{{M}_{{\mathrm{PR}}{\mathrm{S}}_{{\mathrm{only}}}}},$$where M denotes the area under the precision–recall curve (AUPRC) or R2. Next, for each phenotype and individual, we calculated the absolute difference between the predicted phenotype values obtained from a linear model using either the common variant PRS alone or the common variant PRS together with the rare variant burdens and ranked individuals based on the magnitude of this difference. At each rank, we determined the count of individuals exhibiting outlier phenotypes, specifically those falling within the top or bottom 1% of the phenotypic distribution. Finally, we tested the enrichment of phenotype predictor outliers in individuals with extreme phenotypes. For each z-score phenotype outlier cutoff, we identified individuals above the phenotypic cutoff and determined the proportion of these individuals with a predicted phenotype value exceeding the 99% quantile. Enrichment scores were scaled relative to the baseline population (z-score = 0) and compared to the common PRS-only model.Statistics and reproducibilityNo statistical method was used to predetermine the sample size. We did not use any study design that required randomization or blinding. For benchmarking experiments we restricted to unrelated individuals of European ancestry. All individuals (related and multi-ancestry) were included for biological discovery.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles