Detection of distant relatedness in biobanks to identify undiagnosed cases of Mendelian disease as applied to Long QT syndrome

All subjects provided written informed consent, and participating studies obtained ethical approval from the Vanderbilt University Medical Center Institutional Review Board under protocols 9047 and 182258. Subjects did not receive compensation. Subject sex was determined by self-report and genotype. Subject sex was included as a variable in all multivariable analyses.Clinic subjectsNine probands and three family members from the Genetic Arrhythmia Clinic at VUMC were clinically genotyped and found to be heterozygous carriers of the KCNE1 p.Asp76Asn variant. One proband was identified incidentally in the eMERGE III (Electronic Medical Records and Genomics Phase III) sequencing study36, and was subsequently referred to the Genetic Arrhythmia Clinic. All clinic carriers were of European descent.Biobank subjectsBioVU is the VUMC biorepository linking deidentified electronic health records (EHRs) to over 300,000 DNA samples derived from specimens about to be discarded after clinical testing37. The EHR at VUMC contributes specimen-linked de-identified demographic data, clinical notes, electronic orders, laboratory measurements (including ECG data), Current Procedures Terminology (CPT) codes, and International Classification of Diseases (ICD-9 and ICD-10) codes. Currently in BioVU, 95,124 individuals have been genotyped on the Illumina Expanded Multi-Ethnic Genomic Array (MEGAEX), and 54,347 of these have had at least one ECG recorded.MEGAEX genotyping and quality controlVariants with > 2% missingness and individuals with >3% missingness were excluded from further analysis. All BioVU subjects were projected against principal components from all populations in 1000 Genomes22 to determine genetic ancestries using PLINK v.1.90b6.338,39. Within population groups, additional quality control was conducted using the following thresholds for inclusion: minor allele frequency (MAF) > 0.01, variant missingness < 0.05, sample missingness < 0.1, heterozygosity F < 0.2, Hardy-Weinberg equilibrium p-value < 1 × 10−10, removal of duplicate samples, and a requirement for genetic sex to match clinical records. After the quality assessments, 5925 subjects were filtered out, and 69,819 subjects were kept in the European superpopulation (EA), 15,603 in the African superpopulation (AFR), 897 in the East Asian superpopulation (EAS), 414 in the South Asian superpopulation (SAS), and 2466 in the Admixed Americans superpopulation, as previously described40. Therefore, 69,819 EA individuals with 718,367 variants were kept in further analysis. Although KCNE1 p.Asp76Asn was genotyped by the MEGAEX array, it was excluded from downstream analysis as it did not pass MAF threshold (rs74315445 MAF = 0.000156 in Europeans32) in quality control (QC).The 12 clinic subjects were genotyped on the MEGAEX array to generate haplotype data using the same array and following the same protocols as those in BioVU. Because all 12 were of European ancestry and bias can be introduced in small and highly related samples, we carried forward the set of QC-passed genetic variants in the BioVU EA dataset rather than conducting variant-based QC separately in the clinical subjects. For individual-level QC, all clinic samples’ genetic sex matched the recorded sex, heterozygosity levels were below 0.2, and all had a call rate >0.989. All 12 clinical subjects passed individual-level QC and were used for further analysis.Phasing and IBD detectionPhasing, the process of determining haplotypes (the sequence of alleles on a single chromosome) from genetic data, is a critical step for accurately identifying IBD segments. SHAPEIT441 was used to establish phase in genotype data from both BioVU EA subjects and clinic samples, separately. The BioVU EA dataset is sufficiently large to conduct phasing without an external reference panel. For the 12 clinic samples, the phased BioVU EA genetic data were used as the reference panel during phasing. The BioVU EA subjects and clinic samples were merged after phasing using BCFtools v1.942. We leveraged hap-IBD v1.027, a seed-and-extension approach, to detect IBD segments efficiently on biobank-scale data. We required a minimum shared segment length of 100 contiguous genetic markers and a minimum length of 2 centimorgan (cM) as initial seeds in hap-IBD and carried IBD segments longer than 3 cM forward in analysis to minimize analysis of erroneous segments. In addition, we extended our analyses with shorter IBD segments to find more distant relatedness, and we used hap-IBD for detecting IBD segments with a length longer than 1 cM.Pedigree reconstruction and distant relatedness estimationGenome-wide IBD proportions were calculated using the method-of-moments estimation function in PLINK38 after removing ancestry-informative SNPs in PRIMUS v1.9.06. Non-directional networks of first- and second-degree related individuals were reconstructed into pedigrees using PRIMUS. In addition the length and distribution of IBD segments genome-wide were used to identify more distant relatives (up to ninth degree) by ERSA 2.143, and then passed into PADRE 1.04 to generate pedigree-aware estimates of distant relatedness. Relatedness estimation and pedigree reconstruction were conducted in BioVU EA and clinic subjects separately.Local IBD clusteringThe purpose of local IBD clustering is to identify sets of people who share an identical IBD segment spanning a specific genomic region (gene) or position (genetic variant). Since current IBD detection tools only report pairwise IBD sharing, we developed the tool DRIVE. DRIVE is implemented in python 3.6 (https://github.com/belowlab/drive). DRIVE identifies all pairwise IBD segments spanning the target variant/gene and then conducts a three-step random walk approach, using segment length as the probability weight. Networks of close relatives would be expected to be highly connected, however spurious networks or networks connecting very short segments (due to distant relatedness) may be more sparsely connected. Therefore, for large (n > 30) and sparse (proportion of connected edges < 0.5) clusters, an additional random walk is conducted to split the clusters into smaller but more highly connected sub-clusters, with a maximum of five iterations of this process. Finally, using FastME 2.044, the inverse of shared IBD segment length is used to generate phylogenetic dendrograms for the clusters of interest.Whole exome-sequencing validationThe presence of KCNE1 p.Asp76Asn in each BioVU subject identified as sharing a p.Asp76Asn haplotype was assessed by exome sequencing on an Illumina NextSeq 500 with 150 bp paired-end reads following a standard Illumina protocol by the sequencing core at VANTAGE (Vanderbilt Technologies for Advanced Genomics). Sequencing quality control was conducted by fastp 0.20.1 to filter out short and low-quality reads, following established approaches45. Reads that passed QC were aligned to the hg38 human genome reference by BWA-MEM2 2.2.146, and the exome-wide variants were called by GATK 4.2.6.1 following standard best practices47.Mutation age estimationIncluding the BioVU p.Asp76Asn carriers identified by DRIVE, the clusters showed evidence of sharing a small haplotype at KCNE1. Because this suggests co-inheritance from a common ancestor, we randomly selected two from each cluster and estimated the age of the mutation event using the recombination clock model within the Genealogical Estimation of Variant Age (GEVA) tool, v1 beta21. The length of shared segments spanning the target variant were used to estimate the time to the most recent common ancestor with the European ancestry 1000 Genomes data as reference22. GEVA then estimates the age of the mutation event by comparing the estimated time to the most recent common ancestor among the pair of subjects that carry the target variant relative to pairs in which only one subject carries the target variant and the other one does not.Genome imputation and polygenic risk score calculationGenome-wide genotype imputation was performed on p.Asp76Asn carriers, along with 3000 BioVU subjects for the process of QC, phasing, and imputation on the Michigan Imputation Server48. The imputation controls were randomly selected from the subset of BioVU EA subjects with MEGA genotypes that passed QC, as described above. The subset of SNPs passing quality control that overlapped between p.Asp76Asn carriers and the imputation controls were used in imputation on the Michigan Imputation Server using the European ancestry Haplotype Reference Consortium version r1.1 2016 reference haplotypes49. SNPs with low imputation quality (R2 < 0.3) were filtered before further analysis.We then used the PRS for QTc developed by Nauffal et al.50 This PRS comprises 1,110,494 SNPs, and there were 1,110,297 overlapping SNPs between the PRS and our imputed genomes. Using the score function in PLINK 2.039, we calculated the PRS from the imputed genetic data of each carrier and control subject.Electronic health records reviewECGs obtained during routine clinical care were available for all clinic carriers, and for 13/22 biobank carriers. The QT interval corrected for heart rate (QTc) was calculated using the Bazett formula. For clinic and BioVU p.Asp76Asn carriers, QTc measurements were only used if the ECG was sinus rhythm or atrially paced, had a QRS duration <110 msec and a heart rate 50–100 bpm, and was not obtained while the subject was hospitalized or prescribed QT-prolonging drugs (Supplementary Table 1).Arrhythmia diagnoses in each carrier’s deidentified EHR were determined using ICD-9 and ICD-10 codes (Supplementary Table 2). Additionally, a text-search of each carrier’s EHR was performed for the terms cardiac arrest, long QT, LQT, PMVT, polymorphic ventricular tachycardia, SCD, seizure, sudden cardiac death, syncope, Torsade(s), TdP, ventricular fibrillation, V fib, and VF; any matches were reviewed by a physician to confirm the diagnosis.ECG controls were defined as the BioVU subjects who (1) clustered with 1000 Genomes subjects from the European superpopulation (European Ancestries-like, EA) in genetic principal components analysis and passed genomic QC (described above) (n = 69,819) and (2) had an ECG that met the following criteria: read as normal, in sinus rhythm, with QRS duration 65–110 msec and heart rate 50–100 bpm, obtained in an outpatient setting with no inpatient visits immediately before or after, with no QT-prolonging drugs prescribed at the time of the ECG (see Supplementary Table 1 for list of drugs51), from a subject with no prior heart disease ICD-9 or ICD-10 codes and potassium within the VUMC lab normal limits. Controls were restricted to the EA subset because all clinic carriers self-described as White and clustered with 1000 Genomes subjects from the European super-population, as above. One BioVU p.Asp76Asn carrier overlapped with this set and was removed from the control group. This resulted in an ECG control group of 3435 genotyped individuals (2218 females,1217 males) with an ECG meeting our criteria.Statistical analysisStatistical tests were performed using R version 4.2.152 Regression modeling used the rms package in R, v6.8-153. For continuous variables, parametric testing was used if each group had > 10 members and satisfied the Shapiro-Wilk test; otherwise, non-parametric testing was used. All parametric tests were two-tailed. P-value < 0.05 was considered statistically significant. For categorical variables, the Fisher’s exact test was used due to small sample sizes.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles