Interpretable GWAS by linking clinical phenotypes to quantifiable immune repertoire components

Genome-wide association of TCR repertoire unitsWe analyzed a cohort comprising both whole genome sequencing (WGS) and longitudinal peripheral blood RNA sequencing (RNA-seq) data (rfuQTL training set; Supplementary Table 1)19. Utilizing TRUST420, we identified TCR sequences from RNA-seq data, minimizing the amplification bias inherent in TCR sequencing (TCR-seq) data21. We quantified RFU abundances by counting the number of TCR sequences in each TCR repertoire unit. No significant correlation between RFU abundances and the sample age were observed when assessed at distinct time points (Spearman’s \(\left|\rho \right|\ge 0.2\), FDR < 0.05; Supplementary Fig. 2a). Hence, we aggregated TCR sequences over multiple time points of the same individual to get sufficient TCR sequences for RFU abundance calculations (Supplementary Fig. 2b). After quality control (Methods, Supplementary Fig. 2c, d), the dataset included 659 individuals, with an average of 11,304 unique CDR3 beta chains per individual (Fig. 1a, b).Fig. 1: Variants at TRB and HLA loci associate with RFU abundances.a the number of unique CDR3 sequences per subject in rfuQTL training set. b the distribution of CDR3 amino acid length in the training set. c, an example of variants-RFU associations. The top panel displays the RFU motif, while the bottom part illustrates the unnormalized RFU abundances corresponding to different variants. d P values of the association between whole genome variants and RFU abundances. Only associations with \(P < {10}^{-5}\) were plotted. The black horizontal line indicates a Bonferroni-corrected p-value significance threshold of \({10}^{-11}\). e the proportion of both rfuQTL and non-rfuQTL variants that overlap with ENCODE cCREs, relative to their total numbers. pELS, proximal enhancer-like signature. PLS, promoter-like signature. dELS, distal enhancer-like signature. f the proportion of variants that overlap with ENCODE3 Transcription Factor ChIP-seq Peaks. Only transcription factors with P value < 0.01 were plotted. Significance is evaluated using a one-sided Fisher’s Exact Test, with * indicating P < 0.01, ** P < 0.001, and *** P < 0.0001. The boxplots represent the median (horizontal line within the box), the interquartile range (IQR; the box itself), and the whiskers, which extend to the maximum and minimum values lying within 1.5 times the IQR from the box. Data points outside the whiskers are individually plotted.We used linear regression to test the associations between whole-genome variants and RFU abundances (n = 4953) after quantile normalization and correction for sex and the first five genotype principal components (PCs; Methods, Supplementary Fig. 3a–c). This normalization was necessary to prevent suspicious associations caused by extreme values. Additionally, similar to eQTL analysis22, we adjusted for the first two probabilistic estimation of expression residual (PEER)23 factors to account for hidden confounders, including sequencing depth, to avoid spurious associations (Supplementary Fig. 3d, e). We found 2,083 significant associations (\(P < {10}^{-11}\); Fig. 1d, Supplementary Fig. 4a, Supplementary Data 1). Adjusting for cell type proportions did not affect the association results (Supplementary Fig. 4b). We identified 2076 associations in the T cell receptor beta (TRB) locus, which is crucial for V(D)J recombination in TCR sequence generation. The most significant association was a missense variant in the non-coding region of TRBV 12-5 associated with the abundance of RFU 1415 (Fig. 1c). Consistently, TCRs assigned to RFU 1415 contained the initial sequence of the TRBV 12-5 CDR3 region (“CASGL”)24, suggesting that the variants may regulate the expression of TRBV 12-5. Additionally, we observed a substantial number of rfuQTL variants in non-coding regions, including one variant in TRB enhancer and MTA2 binding site, linked with RFU 2811 (Supplementary Fig. 4c).To further explain the roles of these variants, we analyzed the enrichment across different classes of regulatory elements within the ENCODE Registry candidate cis-Regulatory Elements (cCREs; n = 5). We found a significant enrichment of rfuQTL variants in proximal enhancer-like signature (pELS; \(P=1.3\times {10}^{-12}\), one-sided Fisher’s Exact Test; Fig. 1e). Additionally, we tested the enrichment of rfuQTL variants in transcription factor binding sites using the ENCODE3 Transcription Factor ChIP-seq Peaks dataset (n = 324). We observed significant enrichments for MTA2 and GATA1 (\(P < \,1.5\times {10}^{-4}\); Fig. 1f), with MTA2 previously linked to T cell activation and V(D)J recombination in B cells25,26. These results collectively suggested that both enhancers and specific transcription factors within the TRB locus played a crucial role in shaping the TCR repertoire.We identified seven associations in the HLA locus, known for its influence in shaping the TCR repertoire through thymic selection (Example in Supplementary Fig. 4d). To better interpret the functional consequence of the associated alleles, we also performed association testing between imputed HLA haplotypes (n = 28) and normalized RFU abundances. We identified 19 significant associations, including three in the MHC-I locus and sixteen in the MHC-II locus (Supplementary Table 2). RFUs with stronger associations with MHC-II than MHC-I are expected to be enriched with CD4 T cells, given their role in recognizing antigens presented by MHC-II. To test this hypothesis, we annotated CD4 and CD8 RFUs using TCR sequencing (TCR-seq) data of sorted CD4 and CD8 T cells (Supplementary Table 3; Supplementary Fig. 5a, Supplementary Data 2)27,28. We observed a correlation between the CD4/CD8 abundance ratio in an RFU and the MHC-II/MHC-I effect size ratio (Spearman’s ρ = 0.12, \(P=3.5\times {10}^{-3}\), Supplementary Fig. 4e), indicating that our RFUs were reflective of cell-type specificity.To reproduce these results, we analyzed an independent adult cohort dataset with both HLA haplotype and TCR-seq data (628 individuals; rfuQTL test set, Supplementary Table 1)29. After applying similar quality controls and covariate adjustments (Methods, Supplementary Fig. 3f, g), we observed a consistent effect size of the association between HLA haplotype and RFU abundances (Pearson’s \(r=0.50\), \(P=5.0\times {10}^{-10}\); Supplementary Fig. 4f). Together, these findings aligned with previous studies highlighting the impacts of genetic polymorphism of the TRB and HLA loci on V(D)J gene usage13, and extended these observations to the relative abundance of TCR clusters in a repertoire.Genetic prediction of RFU levels with regularized regression modelWe next sought to predict the RFU abundances using genetic variants since TCR repertoire data is usually not available in large cohorts. We called the predicted RFU abundances as genetically determined RFU (gdRFU). We employed both lasso and elastic net models to predict normalized and covariate-corrected RFU abundances based on several genetic variant sets, using variants from TRB, HLA, or TRB and HLA locus (TRB + HLA). We assessed the model performance through ten-fold cross-validation. The number of well-predicted RFUs of lasso model is larger than Elastic Net (Supplementary Table 4), and thus we focused on lasso model in the downstream analysis. Additionally, we utilized Genome-wide Complex Trait Analysis (GCTA)30 to estimate the heritability of these traits, which is the variance that could be explained by variants in the chosen variant set, setting a benchmark for the upper limits of our prediction accuracy. We observed that using variants solely from the TRB locus or a combination of variants from both TRB and HLA loci yielded comparable predictive outcomes and heritability estimates (Supplementary Fig. 6a, 6b). Consequently, we selected the most effective model from these variant sets.We then defined RFUs with a cross-validated \({R}^{2}\) greater than 0.01 as predictable RFUs (≥10% correlation between predicted and observed abundance), following common practice in transcriptome-wide association studies31. The prediction performance of our model was validated using the 398 individuals with imputed genotype data in rfuQTL test set. After preprocessing and prediction (Methods), we calculated the heritability of RFU abundances and the \({R}^{2}\) between predicted abundances and actual corrected abundances. Our model successfully identified 1,351 predictable RFUs, achieving a mean cross-validated \({R}^{2}\) of 0.025 and a mean predictive \({R}^{2}\) of 0.011 for these RFUs (Fig. 2a). We observed a strong correlation between model performance in cross-validation and the test set (Pearson’s \(r=0.50\), \(P=6.2\times {10}^{-85}\), Fig. 2b), revealing the model’s robustness and transferability, given the variations in genetic architecture, variant quality, sequencing methodologies, and participant ages across the two datasets. Two illustrative examples of these predictions were presented in Fig. 2d and e.Fig. 2: Lasso models predict RFU abundances from genetic variants.a the heritability (\({h}^{2}\)) and performance (\({R}^{2}\)) of cross validation (CV) on rfuQTL training set (n = 659) and testing on rfuQTL test set (n = 398). b the cross-validated and test performance. c the proportion of true naïve T cell within each RFU, comparing the distribution across all RFUs and predictable RFUs. Significance is evaluated using Wilcoxon signed-rank test, with *** indicating \(P < 0.0001\). d, e examples of RFU prediction. The RFU number, cell type, and motif are shown at the top, and a comparison of observed versus predicted values in rfuQTL test set is shown at the bottom.We next annotated the RFUs for T cell phenotypic subsets, including naïve cells (TN), central memory cells (CM), regulatory T cells (Treg), and stem cell-like memory T cells (Tscm) in a public TCR-seq dataset with sorted T cells (Supplementary Table 3)32. We calculated the fraction of each cell type in each RFU, and found predictable RFUs exhibited a significantly higher fraction of TN cells and a significantly lower fraction of Treg cells (Wilcoxon signed-rank test, \(P=1.3\times {10}^{-10}\) for TN and \(P=1.2\times {10}^{-9}\) for Treg; Fig. 2c, Supplementary Fig. 7). This observation is consistent with the previous observation that the TCR sequences of the naïve cells were more heritable compared to memory cells33. We then annotated RFUs as the cell type with the highest fraction if the abundances of different cell types are significant different (Friedman test, \(P < 3.7\times {10}^{-5}\)). In total, we identified 3 Tscm RFUs, 9 CM RFUs, 61 TN RFUs, and 95 Treg RFUs within predictable RFUs (Supplementary Fig. 5b, Supplementary Data 3).gdRFU abundances associated with autoimmune diseasesWe next performed a systematic RFU-wide association analysis (RfuWAS) utilizing individual-level data from the UKBB18 (RfuWAS dataset; Supplementary Table 1). After quality control and gdRFU predictions (Methods), the dataset contained 1086 diseases (by phecodes) and 1351 gdRFU abundances of 337,122 unrelated individuals, after removing relatives up to the second degree. We performed linear regression between diseases and gdRFU abundances, using first ten genotype PCs, sex, and age as covariates, and identified 2309 significant associations (\(P < 3.4\times {10}^{-8}\); Fig. 3a, b, Supplementary Fig. 8a, Supplementary Data 4). These associations showed a marked enrichment in autoimmune diseases (Chi-squared test, \(P < 2.2\,\times {10}^{-16}\)). Among the eight diseases with over 50 gdRFU associations, five were with autoimmune diseases, two were with thyroid diseases having potential autoimmune etiologies34, and one involved iron metabolic disorder, where iron was related with T cell development35. By comparing our results with GWAS associations in the Pan UK Biobank36, we identified that the kidney calculus (Phecode 594.1) was associated with RFU 2579 (\(p=7.1\times {10}^{-10}\)) but no variants in the TRB and HLA loci were associated with this condition. T cells have been implicated in the formation of kidney stones37, and it is possible that selected T cell clones are related to the disease.Fig. 3: Disease associations with gdRFU abundances.a the workflow of RfuWAS. QC, quality control. b Volcano plot showing the associations between various diseases and RFU abundances. We include only associations with \(P < 3.7\times {10}^{-5}\). The dashed line represents the Bonferroni-adjusted p-value threshold. Associations are color-coded by disease types, with types representing fewer than 100 associations classified as “other”. c the fraction of disease associated RFUs within each cell type. The numbers of associated RFUs are indicated in parentheses following each disease type, and the numbers of RFUs per cell type are indicated in parentheses following each cell type. d, f boxplots of z-scores of disease-specific versus non-specific RFUs in disease association analyses for CD (d) and T1D (f). e, g p-values are determined using one-sided Wilcoxon signed-rank test. The proportion of disease-specific RFUs (for CD in e and for T1D in g) among the top N significant RFUs identified by RfuWAS and TCR-seq. The red dashed lines indicate the expected proportion of antigen-specific RFUs under random prediction. Significance is assessed using Fisher’s exact test, with * denoting \(P < 0.05\).We then assessed the proportion of disease-associated gdRFUs in different cell types determined in the previous sections. We found a large proportion of central memory gdRFUs are disease-associated, with an enrichment for Type 1 Diabetes (T1D), CD, and two thyroid disorders (Fisher’s exact test, \(P < 0.0125\); Fig. 3c). Additionally, we compared the proportion of disease-associated gdRFUs in CD4 and CD8 cell types. While more CD4 gdRFUs exhibited disease associations, this increase was not significant (Supplementary Fig. 8c). Restricted within pathogenic gdRFUs (gdRFUs positively associated with disease), the proportion of disease-associated RFUs became significant higher from random in diseases like T1D and thyroid disorders (Fisher’s exact test, \(P < 0.025\), Supplementary Fig. 8d).We next investigated the antigen-specificity of these pathogenic gdRFUs. As described in our previous benchmark, TCRs assigned to the same RFU have 90% of the chance to be specific to the same pool of antigens. To demonstrate that our previous benchmark results were generalizable to diverse antigens, we repeated the benchmark with ten randomly selected epitopes. We found that the RFU method achieved similar performance, revealing its robustness across different antigens (Supplementary Data 5, Supplementary Fig. 8b). Therefore, the specificity of an RFU can be evaluated with the TCRs of known antigen-specificity. We analyzed such TCRs specific to CD-related antigens (gliadin) and T1D-related antigens (mainly Glutamic decarboxylase 65, GAD65) from McPAS-TCR database38 (Supplementary Data 6) and observed disease-associated gdRFUs were more likely to be antigen-specific, as measured by higher z-score (one-sided Wilcoxon signed-rank test, \(P < 0.05\); Fig. 3d, f). The proportion of antigen-specific gdRFUs in the top N significantly disease-associated pathogenic gdRFUs was higher than random (Fisher’s exact test, \(P < 0.05\); Fig. 3e, g). In contrast, similar analysis using disease-associated RFUs from TCR-seq dataset32,39 (Supplementary Table 3, Methods) revealed no significant difference in the proportion of antigen-specific RFUs compared to random prediction (Fig. 3e, g). Most CD-associated antigens are presented by HLA-DQ2 and T1D-associated antigens by HLA-DR4 (Supplementary Data 6). We further analyzed antigen-specific gdRFUs within HLA-restricted cohorts and observed a higher proportion of antigen-specific gdRFUs than random (Supplementary Fig. 8e, f). Together, these results potentially provided a genetic association-based approach to prioritize antigen-specific RFUs.gdRFU abundances associated with cancer survivalSimilar to autoimmune disorders, the outcomes of most malignant cancers are strongly influenced by the adaptive immune system40,41. We then investigated the impact of gdRFU abundances on cancer survival using the UKBB samples. Patient survival was defined by all-cause mortality. We used Cox proportional hazard model to test the effects of gdRFU on 42 neoplasms with over 130 events, adjusted for the first ten genotype PCs, sex, and diagnosis age. We identified five significant associations (FDR < 0.05; Fig. 4a–d, Supplementary Fig. 9), comprising four protective gdRFUs and one pathogenic gdRFU. Simultaneously, we evaluated all variants within the TRB and HLA loci for their impact on cancer survival and found that none surpassed the Bonferroni-adjusted p-value threshold (\(1.6\times {10}^{-6}\); Fig. 4a–d). Additionally, we used the gene expression of TRB and HLA genes predicted from the TWAS Elastic Net model of whole blood42 to test associations with cancer survival. We confirmed that the associations observed for gdRFU were not covered by the survival relevance of gene expression. Our findings align with previous research suggesting that the combinations of TRBV/TRBJ gene usage and HLA alleles associated with cancer survival, whereas the impact of these elements alone was not significant43,44. Together, these results suggested that the combined effects from multiple variants within these loci, interpreted from an immunological perspective, could reveal new genetic factors associated with cancer survival.Fig. 4: Cancer survival outcomes and gdRFU abundances.a–d Miami plot for cancer survival associations with gdRFU (upper) and variants in TRB and HLA loci (lower) for cancer of bronchus; lung (phecode 165.1, a), diffuse large B cell lymphoma (phecode 202.24, ICD 10 code C83.3, b), malignant and unknown neoplasms of brain and nervous system (phecode 191, c), malignant neoplasm of ovary and other uterine adnexa (phecode 184.1, d). Each point represents a p-value of the Cox proportional hazards models with gdRFU or genetic variants in HLA and TRB loci as variables, ordered by genomic position and colored by pathogenic or protective effects. The genomic position of gdRFU is defined as the variant position with the most significant association. Gray dashed lines indicate a Bonferroni-corrected p-value significance threshold. e a volcano plot showing genes differentially expressed between patients with protective RFUs and without protective RFUs. Fold changes were calculated using the median values of the two groups, and p-value were determined using the Wilcoxon signed-rank test and adjusted with Bonferroni procedure. The horizontal dashed line represents FDR of 0.05, and the vertical dashed line indicates a fold change of 2. Genes are color-coded as follows: red for significant differential expression, green for fold change > 2 or < \(\frac{1}{2}\) with FDR > 0.05, blue for fold change between \(\frac{1}{2}\) and 2 with FDR < 0.05, and grey for fold change between \(\frac{1}{2}\) and 2 with FDR > 0.05. f a gene set network diagram for gene sets enriched in patients with protective RFUs. Each node represents a gene set, with the node size correlating to the number of genes. For visualization clarity, only gene sets with an \({FDR} < 5\times {10}^{-5}\) were shown, and four gene sets (related to chemokine response and leukocyte apoptotic processes) distant from the main cluster were omitted.We next examined the potential functions of these gdRFUs. The protective gdRFUs were likely linked to neoantigen-reactive T cells45. For example, gdRFU 3827, protective for lung cancer (Supplementary Fig. 9b), contained NSDHL-specific TCR38, a gene prominently overexpressed in lung adenocarcinoma (LUAD; \(P=2.9\times {10}^{-10}\)) and lung squamous cell carcinoma (LUSC; \(P=2.8\times {10}^{-22}\)). It also predicted worse survival in LUAD (\({HR}=1.4\), \(P=0.038\); TIMER 2.0 sever46). Similarly, gdRFU 3459, protective for ovarian cancer (Supplementary Fig. 9f), harbored WT1-specific TCR47, a marker recognized for its diagnostic and prognostic significance48,49.We then investigated the phenotype characteristics of these protective RFUs in cancer patients. Specifically, we performed RFU analysis using the TCR sequences uncovered by TRUST20 from over 10,000 tumor samples from the Cancer Genome Atlas (TCGA) cohort (Supplementary Table 3). By examining differentially expressed genes (DEGs) between patients with and without protective RFUs, we observed an upregulation of genes related to T cell immunity, including CORO1A, ITGB2, CD74, and CXCL9, in patients with protective RFUs (Fig. 4e). Gene set enrichment analysis50 confirmed that genes expressed at higher levels in patients with protective RFUs were predominantly related to T cell proliferation, differentiation and activation (Fig. 4f). These results highlighted the potential role of the protective gdRFUs in fostering a T cell activation phenotype conducive to cancer protection.

Hot Topics

Related Articles