An optimized instrument variable selection approach to improve causality estimation in association studies

We demonstrate the robustness of our proposed pipeline using five different MR datasets. Overall, the results of these datasets revealed an improvement in the identification of valid causal SNPs using our pipeline when integrated with existing MR methods.Single-sample MR datasetMR analysis using default parameters (without using proposed pipeline)We prepared a single-sample MR dataset comprising 41,802 SNPs using data from PennCATH cohort study (see Materials and Methods section). Linkage disequilibrium (LD) clumping was performed separately on total cholesterol (TC) and coronary artery disease (CAD), resulting in 4093 and 4081 SNPs for each dataset, respectively. The data harmonisation was applied to LD-clumped SNPs, which resulted in 679 SNPs for MR analysis. Of these, 668 SNPs exhibited causal effects of TC on CAD. However, the results of MR analysis using five methods with default parameters showed no causal association between TC and CAD, as illustrated in Table 1. Furthermore, a spurious negative association (\(\beta<\)0) was observed, suggesting a weak ambiguous relationship between TC and CAD.MR analysis using proposed pipelineThe t-statistics were calculated separately for TC and CAD on 41,802 SNPs of single-sample MR dataset. The average t-statistic threshold was applied for TC (0.79) and CAD (0.80). A total of 8757 and 9009 SNPs satisfied the average t-statistics threshold in the TC and CAD, respectively. Further, we applied LD-clumping on the selected SNPs which resulted in 2196 SNPs for TC and 2232 SNPs for CAD. None of these SNPs demonstrated association with both TC and CAD, indicating that they all followed the third MR-study assumption. Afterwards, data-harmonization was performed on the above-selected SNPs, resulting in 153 SNPs for MR analysis. Out of 153 SNPs, 150 SNPs displayed causality in all five MR methods. A total of 42 SNPs out of 153 SNPs overlap with 668 SNPs identified above using default parameters. Generally, for a large sample-size dataset, F-statistics > 10 as a threshold has been suggested for detecting dubious bias13. However, this threshold cannot be achieved due to the small (n = 1277) sample size of our dataset.Table 1 Comparison of MR method results of the single-sample (PennCATH study) dataset using default parameters and our pipeline.In the next step, five MR methods were applied to analyze the harmonized dataset (see Table 1). All MR methods except MR-Egger demonstrated a causal association between TC and CAD. The MR-Egger method has consistently resulted in less significant results for cholesterol and heart disease in previous studies14,15. To ensure visual clarity, we have presented figures for 74 SNPs (out of 150 SNPs) that possess a TRUE value for the Steiger filter. Our results indicated a positive association between TC and CAD which is in contrast to results obtained using default parameters. The scatter plot illustrates this positive association between TC and the likelihood of CAD (see Fig. 3A). Further, the MR-Egger with a simulation extrapolation (SIMEX) method was applied to verify the horizontal bias correction. The result demonstrated a significant association between TC and CAD (P = \({1.20} \times 10^{-2}\)). Moreover, the directionality test showed the value of \(r^2\) of exposure (0.36) exceeded the outcome (0.34). This indicates the direction of causality is from TC to CAD. Our analysis indicated the absence of reverse causality in this dataset. Further, the heterogeneity test confirms the lack of heterogeneity indicated by non-significant Cochran’s Q value (104) and P (0.99). After the heterogeneity test, we performed MR-PRESSO (MR-Pleiotropy RESidual Sum and Outlier) analysis to check for outliers among the selected IVs. Our MR-PRESSO analysis did not find any outliers, suggesting a minimal likelihood of horizontal pleiotropy and thereby supporting the validity of the third assumption in MR studies. In addition, the pleiotropy test produced a P = \({2.32} \times 10^{-8}\) and an intercept = 0.13, indicating less occurrence of horizontal pleiotropy in the selected IVs. Furthermore, we illustrate the influence of individual SNPs on the causal estimate of CAD through leave-one-out (LOO) analysis in Fig. 3B. In addition, the collective impact of all SNPs correlated with TC on CAD is shown in the forest plot (Fig. 3C). All plots for 150 SNPs are displayed in the supplementary document (see Figs. S1, S2A, and S2(B)).Fig. 3Single-sample MR analysis of total cholesterol (TC) on coronary artery disease (CAD) (panels A–C). Two-sample MR analysis of liver iron content (LIC) on liver cell carcinoma (LCC) (panels D–F). (A,D) Scatter plot compares five different MR methods. (B,E) Leave-one-out analysis showing influence of individual SNPs. (C,F) Forest plot shows collective impact of all SNPs.Two-sample MR datasetMR analysis using default parameters (without using proposed pipeline)The two-sample MR datasets for liver-iron-content-liver-cell-carcinoma (LIC-LCC) were collected from the MRC IEU database. At first, we retrieved 10 SNPs associated with LIC (exposure) based on their genome-wide significance threshold (P < \({5} \times 10^{-8}\)). The SNPs associated with LCC (outcome) were obtained from the same database. However, we searched only for those SNPs that overlap with SNPs identified for LIC and a total of 8 SNPs associated with LIC and LCC were obtained. The LD-clumping was performed on both datasets with default parameters. The LD-clumping resulted in 7 and 5 SNPs in LIC and LCC, respectively. Afterwards, the harmonisation was performed on the LD clumped data, resulting in 5 SNPs. The MR analysis was conducted on these 5 SNPs, results are shown in Table 2.Table 2 Comparison of MR method results of the two-sample (LIC-LCC) dataset using default parameters and our pipeline.MR analysis using proposed pipelineIn our proposed pipeline, after retrieving the LIC and LCC datasets, a preprocessing step was performed. After preprocessing, a total of 4233 SNPs for LIC and 1624 SNPs for LCC were obtained. The average t-statistic threshold was calculated separately for LIC (3.26) and LCC (0.80). A total of 1070 and 355 SNPs satisfied the average t-statistics threshold in the LIC and LCC, respectively. The LD-clumping resulted in 473 and 224 SNPs for the LIC and LCC, respectively. None of the SNPs obtained after LD-clumping demonstrated association with both LIC and LCC, indicating that all these SNPs followed the third assumption of MR-study. The data harmonization process resulted in 14 SNPs. Our pipeline found more causal SNPs (n = 14) compared to default parameters (n = 5) with only one overlapping SNP (rs80215559). All 14 SNPs identified from our pipeline had F-statistics > 10, indicating their strong candidacy for causality tests and minimal bias in MR analyses, even in the presence of potential sample overlap. We then applied the MRLap method to the LIC and LCC datasets, and the estimates from MRLap were consistent with those obtained from the IVW method. This further suggests minimal bias due to sample overlap, if present. Detailed results are provided in the supplementary material16.
The results of MR analysis are shown in Table 2. All five methods displayed a causal association between LIC and LCC. All these 14 SNPs passed the Steiger filter. The scatter plot of MR methods depicted a positive correlation between LIC and LCC (see Fig. 3D). The horizontal bias correction using MR-Egger SIMEX method displayed a significant association between LIC and LCC (P = 0.03). The directionality-test indicated the direction of causality from LIC (\(r^2_{exposure}\) = 0.03) to LCC (\(r^2_{outcome}\) = 0). Further, our analysis indicated an absence of reverse causality in this dataset. The heterogeneity test confirms the lack of heterogeneity (Cochran’s Q-value = 1.99, P = 0.99). The MR-PRESSO analysis did not find any outlier, suggesting a low probability of horizontal pleiotropy. Though the pleiotropy test (P = \({3.49} \times 10^{-3}\)) indicated the presence of horizontal pleiotropy, but the intercept value (\({6.98} \times 10^{-5}\)) was extremely small. This implies that the horizontal pleiotropy had no effect on the MR analysis. The impact of individual SNPs on the causal estimate of LCC is demonstrated through LOO analysis (see Fig. 3E), and the collective impact of all 14 SNPs correlated with LIC on LCC is shown in the forest plot (Fig. 3F).To evaluate the robustness of our proposed pipeline, we applied it to four different two-sample datasets collected from different super-populations. The F-statistics for the selected IVs were calculated for all four datasets, resulting values > 10. We then applied the MRLap method to these datasets. Both the F-statistics and MRLap estimates indicate minimal bias resulting from sample overlap, if present (see the supplementary material for details). Table 3 demonstrates a comparison between two frequently used MR methods with default parameters implemented in MR analysis and our proposed pipeline. The MR analysis results with default parameters were comparatively less significant in comparison to the result obtained by our pipeline. These results are explained in the supplementary document. Additionally, the chromosomal mapping of causal SNPs identified in our study comprises all five MR datasets, are displayed in Fig. 4. These SNPs are distributed across 22 chromosomes without displaying any discernible bias towards specific chromosome arms. Chromosome 21 lacks causal SNPs for any of the studied MR samples. Additionally, since our focus was solely on autosomes, chromosomes X and Y also do not feature any causal variants in this analysis.Fig. 4Chromosomal mapping of causal SNPs identified using our pipeline on one single-sample (triangle shapes) and four two-sample (coloured circles) MR datasets.Table 3 Shows P value comparison of inverse-variance weighted (IVW) and MR-Egger methods on five different datasets using default parameters and after using our pipeline.Functional annotation and enrichmentWe performed functional annotation of the identified causal SNPs to understand their biological significance and functionalities in respective exposures and outcomes. We first searched the literature to identify the functions of SNPs found in our analysis by querying genome-wide repository of associations between SNPs and phenotypes (GRASP)17 database. The detailed analysis is presented in the supplementary Table S1.Expression quantitative trait locus (eQTL) analysisIn single-sample MR dataset, identified causal SNPs that influence CAD through TC, 11 SNPs display expression across various tissues. SNPs rs2855122, rs1455180, rs10464982, rs1484209, rs3822668, rs2726958 and rs13251032 exhibited expression in whole-blood. Additionally, rs9630400 and rs2726958 showed expression in the left ventricle. Furthermore, rs10190872 and rs2178532 demonstrated expression in adipose-subcutaneous tissue and whole-blood, with rs2178532 also showing expression in the aorta. Finally, rs370389 exhibited expression in the atrial appendage. Similarly, eQTL analysis was conducted in the two-sample MR data focusing on LCC risk mediated by LIC. Three SNPs (rs7242964, rs77251002, and rs80215559) out of 14 identified causal SNPs revealed tissue-specific expression patterns in liver tissue. The details of the corresponding P of SNPs obtained from both datasets are displayed in Supplementary Table S2.Gene annotationTo annotate SNPs associated with protein-coding genes, we used the BiomaRt, R package18,19. The gene ontology (GO) analysis on genes identified from the single-sample dataset confirmed their involvement in cholesterol modifications and CAD susceptibility. Molecular functions (MF) such as “Guanyl nucleotide binding”20, “GTP binding”21, “Ferritin receptor activity”22, and “Phosphatidylinositol binding”23, along with biological processes (BP) like “Microvillus assembly”24 and “Cytolysis”25, display a role of cholesterol in various cellular functions. Cholesterol also induces “Positive regulation of pyroptosis”26,27, linked to NOD-like receptor 3 (NLRP3) inflammasome activation, which is associated with hyperlipidemia, diabetes, hypertension, obesity, and hyperhomocysteinemia28. Additionally, cellular secretion of interferon (IFN)-\(\gamma\) and cholesterol contribute to atherosclerotic CAD29. Gene enrichment analysis highlighted the presence of elongator core complex and trafficking protein particle (TRAPP) complex proteins. Mutations in one of the subunits of the elongator (i.e., ELP2) can lead to Hypertrophic cardiomyopathy, suggesting ELP2 as a potential target for cardiac hypertrophy treatment30. Furthermore, dysfunction in the TRAPP complex may play role in cardiac dysfunction, warranting further investigation into its potential implications in cardiovascular health31. Figure 5A and supplementary Fig. S3 represent enrichment analysis resulting from g:Profiler32 and GOnet33 web-tools, respectively.Similarly, GO analysis was performed on genes identified from the two-sample dataset. MF such as “Malic enzyme activity”, “Protein kinase C activity”, “Ca-dependent protein kinase C activity”, and “Ca-dependent serine/threonine kinase activity” highlight key regulatory pathways implicated in LCC progression. Disruptions in malic enzyme regulatory pathways have been associated with reduced LCC progression, while overexpression of calcium-dependent protein serine/threonine kinase promotes cancer cell proliferation. BP like “Sodium ion transport”, “TRAM-dependent TLR4 signalling pathway”, “Positive regulation of mucus secretion”, and “Regulation of cellular glucuronidation” underscore the intricate mechanisms underlying liver cancer pathogenesis, including immune response modulation and metabolic alterations. Further, “Signaling by ERBB2” and “DAG and IP3 signaling and Effects of PIP2 hydrolysis” implicate the activation of signaling cascades and cellular responses that contribute to LCC development. Overall, the comprehensive analysis of GO and pathways underscores the biological significance of identified causal genes in LCC and LIC regulation (see Fig. 5B).Fig. 5Enrichment analysis of identified causal genes in (A) single-sample (TC and CAD) MR dataset and (B) two-sample (LIC-LCC) MR dataset; (C) disease enrichment obtained for the single sample dataset. Abbreviations: GO, gene ontology; TC, total cholesterol; CAD, coronary artery disease; LIC, liver iron content; LCC, liver cell carcinoma.Disease enrichmentFor single-sample dataset Fig. 5C presents the top 10 enriched disease terms generated using the PheWeb library34. Our analysis revealed a significant association of identified genes with cholesterol-related conditions and cardiovascular diseases (CVD), including hypertensive heart disease and mixed hyperlipidemia. Further, we identified heart-related gene expressions annotated from causal SNPs in a single-sample dataset (Fig. S3). Similarly, for the two-sample MR dataset, we compared the expression levels of ME1 and SLC17A2 genes in LCC patients and healthy individuals. We found a significant (P = \({1.6} \times 10^{-6}\)) over-expression of the ME1 gene (Fig. S4A–B), while the SLC17A2 gene exhibited under-expression in LCC patients35. Moreover, survival analysis revealed that increased ME1 expression and lower SLC17A2 expression were associated with poorer survival in LCC patients (Fig. S4C–D). Further, the protein expression of both genes was analyzed (see the supplementary material and Fig. S4E)36.

Hot Topics

Related Articles