Comprehensive genome analysis and variant detection at scale using DRAGEN

Algorithms for comprehensive genomics at scale and accuracyIn this paper, we present a framework (DRAGEN v.4.2.4) to identify all types of genomic variations at scale and cost. Figure 1 provides a brief overview of DRAGEN’s main components. First, each sample is mapped to a pangenome reference, consisting of a reference and several assemblies, for example, GRCh38 in addition to 64 haplotypes (32 samples) together with reference corrections previously reported23 to overcome errors on the human genome. The pangenome reference includes variants from multiple genome assemblies to better represent sequence diversity between individuals throughout the human population. Briefly, the seed-based mapping considers both primary (for example, GRCh38) and secondary contigs (phased haplotypes from various populations) throughout the pangenome. The alignment is controlled over established relationships of primary and secondary contigs and is adjusted accordingly for mapping quality and scoring (see Supplementary Information for details). DRAGEN’s mapping process for a 35× WGS paired-end dataset requires approximately 8 min of computation time using an onsite DRAGEN server (Supplementary Table 1 provides details of the time taken in each step for both an Amazon Web Services (AWS) F1 instance and an onsite Phase4 server). The pangenome reference can be updated with advancements (for example, T2T-CHM13 or HPRC pangenome assemblies) and can enable a more precise and comprehensive alignment of the short reads. These improved alignments are leveraged for variant calling.Fig. 1: Overview of the DRAGEN variant calling pipeline.a–g, DRAGEN improves variant identification from a single base pair to multiple megabase pairs of alleles. This is achieved by implementing multiple optimized concepts. a, Mapping uses a pangenome reference including 64 haplotypes. b, SV calling is substantially improved over local assemblies based on breakpoint graphs; Chr, chromosome; DEL, deletion; DUP, duplication; INS, insertion; INV, inversion; BND, breakend (or breakpoint). c, SNV calling is improved using multiple strategies, including machine learning-based scoring and filtering. d, CNV calling uses the multigenome mapping and the SV calling information to make informed decisions; CN, copy number. e, An additional nine tools targeting specific difficult regions of the genome are included, four of which have not been previously reported; Hap, haplotype; Prop., proportion. f, STR calling is integrated based on ExpansionHunter25. g, A gVCF genotyper implementation to provide a population-level fully genotyped VCF file; msVCF, multisample VCF.To identify SNVs and indels (<50 bp), DRAGEN assembles regions with variants using a de Bruijn graph, which is then input to a hidden Markov model with previously estimated noise and error levels per sample. The output is a (g)VCF file. The SNV caller has key innovations to deal with noise or sequencing errors, including (1) sample-specific PCR noise estimation, (2) correlated pileup errors estimation, (3) consideration of overlapping candidate events and (4) local assembly failures and incomplete haplotype candidates. After the initial variant calling, a machine learning framework rescores calls to further reduce false-positive small variants (both SNVs and indels) and recover wrongly discarded false negatives (see Fig. 1 and Supplementary Information for details).Simultaneously, DRAGEN identifies SVs (≥50 bp genomic alterations) and copy number events (≥1 kbp genomic alterations) using two methods (see Fig. 1 and Supplementary Information for details). For SV calling, DRAGEN extends Manta24 by introducing the following key concepts that substantially improve SV calling: (1) a new mobile element insertion detector for large insertion calling, (2) optimization of proper pair parameters for large deletion calling, (3) improved assembled contig alignment for large insertion discovery, (4) refinements to the assembly step, (5) refinements in the read likelihood calculations step, (6) improved handling of overlapping mates, (7) improved handling of clipped bases and (8) filtering and precision improvements (see Fig. 1 and Supplementary Information for details). For CNV calling, DRAGEN targets 1 kbp and larger variants that cause an amplification or deletion of genomic segments. This CNV caller uses a modified shifting levels model, which identifies the most likely state of input intervals through the Viterbi algorithm (see Fig. 1 and Supplementary Information for details). The CNV caller was also designed to take into consideration the discordant and split-read signals from the SV calling to detect events down to 1 kbp. Furthermore, DRAGEN identifies STR mutations and analyzes known pathogenic genomic regions using a method primarily based on ExpansionHunter25.Some important genes are challenging to genotype due to their high sequence similarity to pseudogenes, repetitive regions and polymorphic nature. To overcome these challenges, DRAGEN integrates nine targeted callers for accurate genotyping of clinically relevant genes (CYP2B6, CYP2D6, CYP21A2, GBA, HBA, LPA, RH, SMN and HLA), of which six of the callers are described here26,27,28,29. In general, DRAGEN uses common SNVs in the population to distinguish gene targets from their paralogous copies to provide copy number estimations for each haplotype. Furthermore, DRAGEN identifies reads that do not follow the general phasing patterns and reports the recombination events that lead to these reads per sample (see Supplementary Information for details on each caller). The CYP2D6 and CYP2B6 genes are important for pharmacogenomics and encode an enzyme that is responsible for metabolizing most commonly used drugs30. The recombination of gene and pseudogene can lead to deletions of part of each copy, generating gene–pseudogene fusions. The variants across CYP21A2 can lead to congenital adrenal hyperplasia31. GBA is an important target gene due to variants that increase the risk for Parkinson’s and Gaucher’s disease and Lewy body dementia32,33. The gene resides in a segmental duplication of 10 kbp with a pseudogene GBAP1. The high sequence homology in GBA/GBAP1 drives homologous recombination and can result in pathogenic gene conversions or CNVs. HLA encodes proteins crucial for immune regulation and response and holds immense importance in research related to autoimmune diseases, organ transplantation and cancer vaccines and immunotherapy34,35. DRAGEN includes a specialized caller to identify HLA class I (HLA-A, HLA-B and HLA-C) and class II (HLA-DQA1, HLA-DQB1 and HLA-DRB1) alleles. Mutations in the HBA genes (HBA1 and HBA2) cause α-thalassemia, an inherited blood disorder characterized by lowered levels of α-globin, a fundamental building block of hemoglobin36. Recurrent homologous recombination can result in 3.8 kbp deletions that create a hybrid copy of HBA1 and HBA2, 4.2 kbp deletions that delete regions that include the HBA2 gene or complete deletion of both. Small pathogenic variants can also be detected within HBA. The LPA gene includes a 5.5 kbp region (KIV-2) whose copy numbers (between 5 and 50+) are inversely related to cardiovascular risk37. DRAGEN can report phased copy numbers for this region29. For RHD/RHCE (RH blood type), copy number predictions can be used to assess the risk of Rh allosensitization38. Another integrated caller identifies CNVs across SMN1 and SMN2, which can indicate spinal muscular atrophy27.The genome-wide simultaneous assessment for SNVs, indels, STRs, SVs and CNVs together with reporting the results from these specialized callers takes approximately 30 min of computation time with an onsite DRAGEN server for a 35× WGS sample. This results in a gVCF file for SNVs and indels, a VCF file for each STR, CNV and SV calls and tabular formats for the specialized gene callers (Fig. 1).Thus, the DRAGEN pipeline is able to capture the entire range from single variants to larger variations across the entire genome at scale and reports them in standardized VCF files. The algorithms are described in detail in Supplementary Information. This pipeline aims to produce a comprehensive and accurate set of genomic variations across the human genome at scale.Resolving the complete variant spectrum at scale and accuracyWe applied DRAGEN to the HG002 sample, for which multiple benchmarks are available16,39,40,41,42. We identified variants using DRAGEN across a 35× coverage HG002 Illumina NovaSeq 6000 2 × 151 bp paired-end read dataset (Methods). Figure 2a shows the distribution of all small and large variants across the HG002 sample and highlights the ability of DRAGEN to capture the entire variant spectrum. This resulted in ~4.92 million small variant calls that includes 3,956,307 SNVs with a transition-to-transversion ratio of 1.98 and an SNV heterozygous-to-homozygous (HET/HOM) ratio of 1.56. A total of 960,908 indels were discovered with an insertions-to-deletion ratio of 1.00 and HET/HOM ratio of 1.865. For SVs, 13,886 variants (≥50 bp) were identified with 5,901 deletions, 7,174 insertions, 42 duplications, 153 inversions and 616 translocations. Additionally, 1,156 CNVs were identified ranging from 1 kbp to 445 kbp with a deletion-to-duplication ratio of 4.25. DRAGEN detected STR expansions or contractions in 31,370 polymorphic loci out of the 50,069 in the STR catalog (homozygous reference 0/0: 37.33%, heterozygous 0/1: 27.36%, homozygous alternate 1/1: 17.8% and heterozygous genotype composed of two different alternate (ALT) alleles 1/2: 17.5%). Relative to GRCh38, 46.66% (14,636) of HG002 STRs have at least one more copy and 53.34% (16,734) have at least one less copy, thus highlighting all the variant complexities that a single genome represents.Fig. 2: Performance overview of DRAGEN based on GIAB benchmarks.a, Length distribution of small and large variants discovered by DRAGEN (bin sizes used for the plot (from left to right) are 500, 250,150, 50, 150, 250 and 500). b, SNV comparison based on GIAB SNV v.4.2.1. c, SNV call comparisons based on CMRG v.1.0. d, Comparison of SV call performance (insertion and deletion types) based on GIAB SV v.0.6. e, Comparison of CMRG SV call performance (insertion and deletion types) based on GIAB CMRG SV v.1.0. f, CNV caller comparison of DRAGEN compared to CNVnator across different sizes of deletions based on GIAB SV v.0.6. g, Benchmarking of STRs using GIAB v.1.0 and the DRAGEN-specific STR caller. The benchmarking results of the DRAGEN small variant caller are represented in light blue (middle). The recall and F-measure scores were calculated using the GIAB catalog, and the recall* and F-measure* were calculated using the individual catalogs of DRAGEN and GangSTR. Results from Truvari comparisons against tandem repeat benchmarks displayed in the figure are restricted to indels of ≥5 bp (default).Using these results, all the variants were evaluated against the Genome in a Bottle (GIAB) benchmarks and compared to other short-read-based callers (Methods). For SNVs and indels, benchmark v.4.2.1 was used on GRCh38 (ref. 40), but for the SV benchmark (v.0.6) (ref. 39), DRAGEN was run on a GRCh37 version of the pangenome reference. Later, the benchmark was expanded across the challenging medically relevant gene (CMRG) catalog41 (see Methods for details). Over all benchmarks, DRAGEN demonstrated higher accuracy and impressive speed up of the analysis from raw reads to finalized variant calls within 30 min total, which is better than any other existing workflow.We first focused on SNV and indel calling for HG002 and compared its performance to other short-read-based methods43 (GATK44 and DeepVariant45 with BWA46). We further benchmarked the recent pangenome approach Giraffe15. Figure 2b shows the F-measures across SNV and indel results (see Supplementary Table 2 for details). Overall, we observed a clear advantage of SNV identification accuracy relative to other methods. For the overall genome-wide small variant calls, DRAGEN achieved an F-measure of 99.86%, yielding a total of 11,163 errors (2,553 false positives and 8,610 false negatives). Compared to DRAGEN, we observed 2.49 times more errors for DeepVariant + BWA calls (F-measure: 99.64%, 3,695 false positives and 24,090 false negatives), 1.79 times more errors for DeepVariant + Giraffe calls (F-measure: 99.74%, 4,980 false positives and 15,021 false negatives) and 6.07 times more errors for GATK + BWA calls (F-measure: 99.13%, 38,622 false positives and 29,163 false negatives) with the same Illumina sample. This is in part due to the methodologies implemented in the SNV calling and in the subsequent machine learning filtering (see Supplementary Information). We observed improvements for SNV and indel (2–50 bp) variant types. DRAGEN achieved a higher F-measure of 99.86% (SNV) and 99.80% (indel) than GATK + BWA, DeepVariant + BWA and DeepVariant + Giraffe (Supplementary Table 2). Thus, clearly, DRAGEN showed an improved performance on SNVs and indels across the entire spectrum, improving the characterization across samples at scale.We next assessed the performance of variant calling in the CMRG catalog. This GIAB benchmark spans 273 medically relevant genes that are highly repetitive and diverse and were therefore excluded from the genome-wide benchmark12. Many of these medically relevant genes overlap segmental duplications and other challenging properties. There is interest to see if short-read sequencing can be used effectively for detecting variants in these repetitive regions. Moreover, several of these medically relevant genes (for example, KCNE1, CBS, CRYAA, KCNJ18, MAP2K3, KMT2C) are wrongly represented in the GRCh38 reference due to false duplication and collapsed sequence errors23. Corrections to these errors have been incorporated into DRAGEN variant calling. Figure 2c shows the results of the individual SNV callers with respect to F-measure (see Supplementary Table 2 for details on the evaluations). For both SNV and indel calls, DRAGEN (F-measure: 98.64%) was better than GATK (95.84%), DeepVariant + BWA (97.32%) and DeepVariant + Giraffe (98.10%). These improvements are present in both SNVs and indels (Supplementary Table 2), thus outperforming the other methods with 13,931 variants across the genome and 509 variants in CMRG regions, which are only identifiable by DRAGEN. We further investigated if this performance differed between exonic and intronic regions. For the exonic regions, DRAGEN achieved an F-measure of 99.78%. For intronic and intergenic regions, the achieved F-measures were 99.87% and 99.85%, respectively. Similarly, variant calling performance was evaluated on exonic and intronic regions using the GIAB CMRG benchmark set. DRAGEN achieved F-measures of 98.97% and 98.66% on exonic and intronic regions, respectively.In addition to the clear improvements of DRAGEN for SNVs (Fig. 2b,c), DRAGEN’s performance across SVs (>50 bp) was also improved. The DRAGEN results were compared to SV calls reported by Manta24, Delly47 and Lumpy48 (Fig. 2d,e and Methods). For insertions, which are often the hardest for SV callers7, DRAGEN achieved an F-measure of 76.90%, which more than doubled the performance of Manta (34.90%) and Delly (4.70%; Lumpy did not report any insertions). This is due to multiple algorithmic innovations in DRAGEN (Supplementary Information). Similarly, DRAGEN achieved a better F-measure (82.60%) for deletions (50+ bp) than Manta (70.80%), Delly (68.30%) and Lumpy (66.80%). Supplementary Table 3 contains details across the SV types. DRAGEN’s performance was also compared for SV detection on CMRG regions. DRAGEN again outperformed the other variant callers with F-measures of 63.50% and 68.00% for insertion (Fig. 2d) and deletion (Fig. 2e) types, respectively. This showcases the ability of short reads to detect SVs with high accuracy even in repetitive regions.DRAGEN also reports CNVs, which include larger deletions and duplications. Here, CNVs are adjusted for the called SV to improve breakpoint accuracy where possible (see Supplementary Information). The performance was compared to that of the CNVnator49 copy number discovery tool and benchmarked using the >1 kbp deletion SV records from the GIAB SV benchmark set (Fig. 2f). For CNVs with lengths in the range of 1–5 kbp and 5–10 kbp, DRAGEN performed substantially better, with F-measures of 92.60% (versus 39.20% for CNVnator) and 96.60% (versus 61.80% for CNVnator), respectively. For CNVs with lengths in the range of 10–20 kbp, 20–50 kbp and >50 kbp, similar performances by DRAGEN (F-measures of 94.10%, 95.20% and 100.00%, respectively) and CNVnator (97.60%, 94.90% and 99.00%, respectively) were observed. Supplementary Table 4 contains all the benchmarking results.Similar to SVs, STRs are often challenging to resolve due to their repetitiveness and complexity50. The accuracy of STR detection by DRAGEN was evaluated using the GIAB tandem repeat benchmark dataset (GIABTR) v.1.0 (ref. 50) and Truvari51. For the STR caller, we assessed two catalogs that are available in DRAGEN that differ in the number of STR loci analyzed. The first catalog consists of 50,069 regions where the F-measure (19.68%) was largely driven by the small size of the catalog compared to the 1.7 million regions represented in GIABTR, which impacts recall. Nevertheless, the precision was high at 95.47%. When using the larger STR catalogs available in DRAGEN, which include 174,300 regions, the F-measure improved to 55.13% with the same precision. To provide context to these results, we benchmarked another short-read caller, GangSTR52, and compared its performance to DRAGEN’s performance (Fig. 2g and Supplementary Table 5). Because GangSTR is optimized for a different set of 832,380 regions, we evaluated performance on the intersection of both methods’ analyzed regions against GIABTR (~174,000; Methods). When restricting the benchmark to the intersection between the two catalogs, DRAGEN achieved a better F-measure of 96.72% (versus 69.86% by GangSTR). When we extended the benchmark to cover all DRAGEN catalog regions, DRAGEN’s F-measures for ~50,000 and ~174,000 catalogs were 94.56% and 94.47%, respectively, whereas GangSTR achieved an F-measure of 62.55% (shown as recall and F-measure in Fig. 2g). Given that this benchmark includes mostly small variants within tandem repeat loci, both the DRAGEN small variant caller and DRAGEN STRs were evaluated against the benchmark. The DRAGEN small variant caller achieved an F-measure of 93.7% when compared to the entire GIABTR benchmark and an F-measure of 87.9% when restricting the comparison to indels larger than 5 bp, indicating that DRAGEN can accurately detect SNVs and small indels in tandem repeat regions and that such events represent the majority of tandem repeat variants in any given genome. The true-positive and true-negative overlaps between the STR caller and small variant caller were 93.67% and 99.49%, respectively, for ~174,000 catalog regions (Supplementary Fig. 1).Last, the performances of all targeted callers were evaluated using orthogonal datasets (Table 1), and DRAGEN calls showed high concordance for all results. The callsets for HG002 were analyzed for each targeted caller. There are two pharmacogenomics-related methods that assess CYP2D6 and CYP2B6 alleles. For HG002, the caller reported *1/*U1;*2/*5 star alleles for CYP2B6 and *2/*4 for CYP2D6. The *1 and *U1 alleles in the first genotype represent the reference allele and specific variant in the gene that has reduced enzyme activity, respectively. Similarly, the second genotype, *2/*5, indicates that the HG002 sample may carry two different variants of the CYP2B6 gene, which reduces enzymatic activity. The CYP2D6 caller for HG002 generated *2/*4 star alleles, which indicates that the sample carries two haplotype variants that are also associated with reduction in the enzyme activity of the gene. The methods for HBA1/HBA2 (α-thalassemia) reported no detected target variants. The CYP2D6 and CYP2B6 callers had 99.3% and 92.1% concordance of star allele calls, respectively, against orthogonal calls in a cohort of samples characterized by Get-RM53. The CYPB26 caller results were also concordant against the long-read-based (PacBio) calls (Table 1). DRAGEN HLA typing on sample HG002 revealed HLA-A*01:01, HLA-A*26:01, HLA-B*35:08, HLA-B*38:01, HLA-C*04:01 and HLA-C*12:03 class I alleles and HLA-DQA1*01:05, HLA-DQA1*03:01, HLA-DQB1*03:02, HLA-DQB1*05:01, HLA-DRB1*10:01 and HLA-DRB1*04:03 class II alleles. The class I genotyping results were perfectly concordant with other callers, such as HLA-LA54, HISAT-genotype55, T1K56 and HLA*ASM57. For HLA class II, the results from DRAGEN were also highly concordant with those from other tools (HLA*ASM gets one allele wrong on HLA-DQA1) except for the HLA-DRB1 allele (improvement pending in DRAGEN). We also compared the Optitype results from 164 samples from the 1kGP samples, and the overall accuracy of DRAGEN (98.27%) was similar to that of Optitype (98.43%; Supplementary Table 6).Table 1 DRAGEN targeted callers and performance summaryFor the SMN caller, HG002 had ‘negative’ affected status and carrier status, 0 copy numbers of SMN2Δ7–8 (deletion of exons 7 and 8) and 3.77 estimated total copy numbers, indicating four haplotypes across the two genes. Benchmarking of the SMN caller showed that 99.8% of SMN1 and 99.7% of SMN2 copy number calls agreed with orthogonal methods (MLPA/droplet digital PCR).The GBA caller assesses GBA and GBAP1 recombinations and variants that can be important for neurological diseases28 and reports whether the sample is biallelic or not for pathogenic variants, the total copy number, and carrier status. For the HG002 sample, DRAGEN reported four total copy numbers and ‘False’ for both ‘is_bi-allelic’ and ‘is_carrier’ fields. Benchmarking showed that GBA calls were 100% concordant with targeted long-read-based methods (ONT) across 42 samples with diverse variant types.The LPA caller assesses LPA copy number status, which provides important information on cardiovascular disease risk29. Interestingly this method provides phasing information for approximately 50% of the samples. HG002 has 39 KIV-2 LPA repeats with allele-specific (alleles 1 and 2) copy numbers of 25 and 14, respectively. These methods are highly specialized for their individual targeted regions of the genome and report important allelic information rather than variants (for example, a single SNV). The benchmark results showed 98.2% and 99.7% correlation between DRAGEN-estimated KIV-2 total copy numbers and allele-specific copy numbers respectively, against KIV-2 copy numbers extracted from Bionano optical mapping across 154 samples. Supplementary Table 6 contains the descriptions of the callers and results for the HG002 sample.Because STR, SV and CNV calls each cover a broad range of variant lengths, it is possible for a single variant to be present in more than one result. Therefore, we developed a procedure to combine DRAGEN STR, SV and CNV calls together to form a comprehensive deduplicated large variant VCF file using Truvari51. The merge procedure analyzed a total of 55,414 variants for HG002 and identified 993 redundant variant representations. To establish the accuracy of the merging, the variants that are labeled SV were extracted from the merged file, and benchmarking was performed using the GIAB SV (v0.6) callset. The benchmarking results of the original SV calls were compared with the benchmarking results after merging and were found to be nearly identical, with only 36 variant representations altered enough to change their benchmarking status (Supplementary Fig. 2).Benchmarking the DRAGEN pipeline shows that it produces accurate results that improve variant performance across all variant types and lengths. The pipeline generates a fully comprehensive representation of a human genome, including all variant types at scale and cost.Improving variant identification across the human populationWith the performance of DRAGEN on HG002 characterized, we next applied the pipeline to other standard GIAB reference samples to assess the accuracy and comprehensiveness of DRAGEN across multiple ancestries. These samples include the HG001 (NA12878) sample, the parent samples of AshkenazimTrio (HG003 and HG004) and the ChineseTrio samples (HG005, HG006 and HG007). Figure 3a shows an overview of the results across variant types and size regiments. An average of 4,894,415 small variants were detected per sample with an average of 3,952,885 SNVs and 941,530 indels per sample. A balance (ratio: 0.999) between small insertions and deletions was observed. The mean SNV transition-to-transversion ratio was observed to be 1.98, and the total HET/HOM ratio was observed to be 1.49. For SVs (≥50 bp), the mean SV count per sample was 14,734, with a range between 14,093 and 15,224 per individual. Across samples, insertions (mean: 48.78%) were the most frequently occurring SV type, followed by deletions (mean: 39.10%), translocations (mean: 5.20%), inversions (mean: 1.37%) and tandem duplications (mean: 0.36%; Supplementary Table 7). This follows the expected distributions of insertions being the most frequent variant type, which is typically not observed by other Illumina-based methods7. DRAGEN calls other variants, such as CNVs, STRs and variants for some complex and medically relevant genes. On average, 632 CNVs per sample (range between 583 and 718) were detected, with lengths between 1 kbp and 500 kbp (Supplementary Table 7). The STRs were detected across 50,069 loci, including 62 known pathogenic loci for each sample. Across the samples, an average of 13,690 heterozygous and 8,901 homozygous STR variant calls were identified.Fig. 3: Performance overview of DRAGEN for samples HG001–HG007.a, Length distribution of different variants for all samples (bin sizes used for the plot from left to right are 500, 250, 150, 50, 150, 250 and 500). b, Recall, precision and F-measures of DRAGEN for samples HG001–HG007. c, Comparison of false-negative (FN) and false-positive (FP) numbers among GATK and DeepVariant (DV) with BWA, DeepVariant with Giraffe and DRAGEN (DRAGEN 4.2) for HG001–HG007 SNV calls. d, Comparison of recall, precision and F-measures of samples HG001–HG007 for four different tools, that is, DRAGEN, GATK and DeepVariant with BWA and Giraffe with DeepVariant. The box plots display the minimum, maximum, median and spread of the middle 50% of the data (the interquartile range (IQR)), with whiskers indicating the range of the data within 1.5× the IQR and points beyond the whiskers representing outliers. e, Average F-measures and errors (false positives and false negatives) for different tools.DRAGEN performance was then evaluated against the GIAB v.4.2.1 benchmarks for samples HG001–HG007 for SNVs and indels40. The recall for genome-wide calls was in the range of 99.76% and 99.87% with precision between 99.90% and 99.93% (Supplementary Table 7). For SNVs and indels, the mean F-measures were 99.80% and 99.87%, respectively (Fig. 3b). This shows a remarkably high consistency across all samples in the performance to identify SNVs and indels. DRAGEN SNV call performance was then compared to that of GATK and DeepVariant calls with BWA and Giraffe15 mapper using the GIAB benchmark for all samples (Fig. 3c,d and Methods). Across all callers and samples, the F-measure was shown to be below that of DRAGEN (GATK: 99.10% to 99.28%; DeepVariant + BWA: 99.61% to 99.71%). The higher F-measure is largely attributed to improved detection of SNVs and indels (Supplementary Table 7). Benchmarking across all seven samples (HG001–HG007) allows further assessment of the ability of DRAGEN to use a pangenome reference. Figure 3c shows the accuracy of DRAGEN compared to the accuracy obtained by aligning on the HPRC reference pangenome with Giraffe15 and variant calling with DeepVariant45, the BWA58 with DeepVariant pipeline and the GATK pipeline. Compared to GATK + BWA, DRAGEN showed an average error reduction of 82.88% on combined SNVs and indels, with an average reduction of 83.95% on SNVs and 76.19% on indels. Compared to DeepVariant + BWA, DRAGEN showed an average error reduction of 60.07% on combined SNVs and indels, with an average reduction of 62.40% and 46.46% on SNVs and indels, respectively, confirming the trend observed in the previously reported precisionFDA V2 samples59. Compared to Giraffe + DeepVariant, DRAGEN reported an average error reduction of 44.33% on combined SNVs and indels, with an average of 45.57% on SNVs and 39.19% on indels. Furthermore, we evaluated the effect of the pangenome reference on DRAGEN variant calling performance. On average, the pangenome reference reduced the error by 54.20% for samples HG001–HG007 for SNVs and indels, with an average reduction of 57.74% on SNVs and 29.52% on indels (Supplementary Table 7).Because these samples are trios (Ashkenazim (HG002, HG003 and HG004) and Chinese (HG005, HG006 and HG007)), the variant calling was further validated based on Mendelian inconsistencies. The percentages of genotypes at which a trio had ‘no missingness’ and ‘no Mendelian error’ for DRAGEN were found to be 97.70% and 96.58% for the AshkenazimTrio and ChineseTrio samples, respectively, when genome-wide analysis was performed. For the DeepVariant (with the BWA-MEM mapper) callsets, the rates were 97.34% and 96.95% for HG002–HG004 and HG005–HG007, respectively. Of note is that DeepVariant joint callsets had 23,838 and 37,003 more variants with missing genotypes than the DRAGEN joint callset for the Ashkenazim and Chinese trios, respectively. When considering GIAB high-confidence regions that encompass 88.43% of the genome and excluding certain complex segmental duplications and centromere regions, the ‘no missingness and no error’ for DRAGEN improved slightly with 99.85% and 99.67% for the respective trios. The DeepVariant results also showed similar performance with 99.85% and 99.75% for these two trios, respectively. Furthermore, DeepVariant joint calls had 8,065 and 20,833 more variants than DRAGEN for the respective trios. The observed de novo variant rate for both trios on the DRAGEN callset was 0.05% (Methods and Supplementary Table 7).Comprehensive variant detection at population scale using DRAGENWe next applied DRAGEN to discover variants in the well-studied high-coverage 1kGP22,23 samples and analyze the catalog of genomic variation at population and cohort levels. The 1kGP samples consist of a total of 3,202 samples from 26 different populations of five different ancestry (that is, superpopulation) groups: African (AFR), European (EUR), South Asian (SAS), East Asian (EAS) and American (AMR). This collection of samples contains 1,598 (49.91%) men and 1,604 (50.09%) women. The AFR samples have the highest number of samples (n = 893, 27.89%), followed by the EUR (n = 633, 21.64%), EAS (n = 601, 18.77%), SAS (n = 585, 18.27%) and AMR (n = 490, 15.30%) samples. Recently, the low-coverage (7.4×) WGS datasets22 of 2,504 samples in 1kGP have been expanded to 3,202 high-coverage (35×) datasets60. We analyzed the 1kGP samples with DRAGEN to compare with the recently published SNV callset60 with GATK and SV callset with a combination of three tools (GATK-SV61, svtools62 and Absinthe63). Analysis with DRAGEN showed an improved performance of variant callings in terms of novel small variants (that is, SNVs and indels) and SVs60.For this analysis, it is important to have accurate single-sample calling methodologies and methods that combine VCF files from multiple individuals. It is also important to be able to annotate the variants rapidly and accurately. To accomplish this, a new gVCF merge method for SNVs and indels was implemented (see Supplementary Information). We used Truvari to combine STRs, SVs and CNVs together. This resulted in two-population merged VCF files, one for small variants (that is, SNVs and indels) and one for larger variant classes.For small variants (<50 bp), the DRAGEN Iterative gVCF Genotyper can efficiently aggregate hundreds of thousands to millions of gVCFs to perform joint calling and genotyping. This generates a fully genotyped population VCF file, which is needed for any genome-wide association studies, rare variant studies, phasing and imputation and ancestry studies. The output population VCF file also contains cohort-level variant statistics (including allele frequency, sample genotype rate and coverage rate) and quality control metrics (such as Hardy–Weinberg test P value and inbreeding coefficient) that can be used for downstream variant filtering (see Supplementary Information for details). Before aggregation, variants with DRAGEN machine learning quality scores below a threshold of QUAL = 3 were filtered. The joint callset had an average per-sample SNV recall of 99.92%, precision of 99.78% and F1-measure of 99.85% and indel recall of 99.84%, precision of 99.71% and F1-measure of 99.77%, as evaluated based on GIAB samples. The aggregation of over 3,202 samples took approximately 2 h on an Illumina Phase4 server with a concurrency of 200 jobs.For STRs, SVs and CNVs, the variants were first combined on a per-individual level to remove redundant variant representations across types using Truvari51. Truvari compares the alleles and sizes together with the location and the type of variant event (for example, deletions versus insertions). Supplementary Fig. 2 shows this across sample HG002 with remarkably similar performance values on SVs only and merged STR, SV and CNV results. After this first step per individual, individuals at the population level were merged.Population-level SNV and indel identificationWe applied DRAGEN across 3,202 high-coverage (35×) 1kGP samples to perform comprehensive variant calls (SNVs, indels, SVs, STRs and CNVs) and demonstrate scalability. The variants were analyzed, and the results were compared to published results60. At the cohort level, DRAGEN identified 116,346,215 SNVs and 24,979,420 indels. The principal component analysis (PCA) plot (Fig. 4a) for the small variants at the cohort level showed distinct clusters for different populations, which demonstrates shared genetic ancestry among samples. The distribution of SNVs and indels at the population level showed that the AFR superpopulation has the highest number of SNVs and indels (Fig. 4b,c) due to the higher diversity of the AFR group, but this is also likely impacted by the high number of AFR samples in the cohort (Supplementary Table 8). The average SNVs per sample ranged from 3,930,793 (EUR) to 4,771,879 (AFR) and followed expected diversity60. The number of small insertions (<50 bp) for the EAS group (521,068) was the lowest and was the highest for the AFR group (626,296). This was interestingly inverted when small deletions (<50 bp) were assessed. The highest proportion of singletons (28.7%) was observed in the AFR population, which also follows previous findings. However, the EAS population had the highest mean number of singletons (that is, ratio of total singletons for a population and the number of samples) compared to other populations.Fig. 4: DRAGEN SNV calls for the 1kGP sample.a, PCA plot of principal component 1 (PC1) and PC2 for SNVs across the 1kGP population. b, Distribution of SNV counts. ASW, African Ancestry in South-West USA; ACB, African Caribbean in Barbados; BEB, Bengali in Bangladesh; GBR, British from England and Scotland; CDX, Chinese Dai in Xishuangbanna, China; CLM, Colombian in Medellín, Colombia; ESN, Esan in Nigeria; FIN, Finnish in Finland; GWD, Gambian in Western Division – Mandinka; GIH, Gujarati Indians in Houston, Texas, USA; CHB, Han Chinese in Beijing, China; CHS, Han Chinese South; IBS, Iberian populations in Spain; ITU, Indian Telugu in the UK; JPT, Japanese in Tokyo, Japan; KHV, Kinh in Ho Chi Minh City, Vietnam; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; MXL, Mexican Ancestry in Los Angeles, CA, USA; PEL, Peruvian in Lima Peru; PUR, Puerto Rican in Puerto Rico; PJL, Punjabi in Lahore, Pakistan; STU, Sri Lankan Tamil in the UK; TSI, Toscani in Italy; YRI, Yoruba in Ibadan, Nigeria. c, Distribution of indel counts at the superpopulation level of 3,202 1kGP samples. The box plots display the minimum, maximum, median and spread of the middle 50% of the data (the interquartile range (IQR)), with whiskers indicating the range of the data within 1.5× the IQR and points beyond the whiskers representing outliers. d,e, Singleton (allele count (AC) = 1), rare (allele frequency (AF) ≤ 1%) and common variant (allele frequency > 1%) counts of GATK v.4.1 and DRAGEN v.4.2 callsets of SNVs (d) and indels (e) across the cohort level. The known and novel variants are based on the dbSNP 155 database. f, Distribution of SNVs based on their functional annotations shown on the top and bottom showing the fraction of known and novel variants; miRNA, microRNA; UTR, untranslated region. g, Distribution of small indels based on their functional annotations. NMD, nonsense-mediated decay.The allele frequency-based analysis on 2,504 unrelated samples showed that DRAGEN generated 55,327,091 (52.05%) singleton, 38,210,741 (35.68%) rare (allele frequency ≤ 1%) and 13,163,982 (12.27%) common (allele frequency > 1%) SNVs. Compared to a previous GATK callset on these samples, it generated 2.15% more singletons, 0.98% fewer rare and 1.63% fewer common SNVs (Fig. 4d,e). For indels, DRAGEN generated 7,140,867 singleton, 8,237,880 rare and 4,229,692 common indels, whereas the GATK callset had 56.18% fewer singleton (3,129,240), 30.48% fewer rare (5,727,021) and 4.88% fewer common indel variants (4,023,422). Using the Illumina Connected Annotations (ICA)64 pipeline (see Supplementary Information), the variants detected by both the DRAGEN and GATK callsets were compared to known SNVs (dbSNP build 155) to determine which variants were previously observed (that is, known) or previously unobserved, to our knowledge (that is, novel). The majority of SNVs (88.98%) from DRAGEN were known, and 72.74% of indels were known variants. The singleton rate of known variants was 52.05% of SNVs and 36.42% of indels (Supplementary Table 9).Although most SNVs and indels were rare, the novel rate of indels with a functional impact was between 9% and 15% across samples, whereas the SNV novel rate was between 1% and 3%. Specifically, among SNVs with functional impact, DRAGEN called 706,355 missense SNVs (33.13% rare and 1.75% novel), 438,735 synonymous SNVs (36.59% rare and 1.18% novel) and 195,345 SNVs with other higher functional impact, including stop/start–gain/loss and splice mutations (33.35% rare and 3.17% novel). For indels with a functional impact, DRAGEN called 24,047 frameshift indels (27.40% rare and 12.60% novel), 22,079 in-frame indels (38.78% rare and 8.28% novel) and 30,978 indels with other higher functional impact, including stop/start–gain/loss and splice mutations (39.04% rare and 10.07% novel; Fig. 4f,g and Supplementary Table 10). We compared the functional annotations of the DRAGEN callset with those of the GATK callset (Fig. 4f,g). In the intronic, intergenic and regulatory regions, more SNVs and indels were called by DRAGEN than by GATK. In these annotation categories, the percentage of rare and novel variants (in particular indels) was higher in DRAGEN than in GATK. In annotation categories with low to high functional impact, DRAGEN called fewer missense, synonymous and functional impact SNVs. The percentage of rare SNVs was higher and the percentage of novel SNVs was lower in the DRAGEN callset. Frameshift indels and indels with a functional impact were higher in DRAGEN and were found to have a lower allele frequency than GATK. The novel rate was similar between the two callsets but varied between categories due to overall lower numbers of indels in these categories.The larger number of singletons and novel small variants (<50 bp; SNVs and indels) could highlight DRAGEN’s increased ability to assess repetitive regions of the genome, which is enabled due to the multigenome mapping implementation (see Supplementary Information). To answer this, we first focused on the CMRG regions that are important for clinical analysis. We analyzed the variants identified by DRAGEN in 389 challenging gene regions and compared them to the previous GATK-based results. DRAGEN identified 1,125,183 (0.80% of total) variants in those regions. This is similar to the GATK results of 1,146,580. Next, we investigated if DRAGEN accurately captures the variants in 12 medically important genes that are underrepresented on GRCh38 (ref. 23). These 12 genes comprise 9 that are incorrectly duplicated and 3 that are incorrectly collapsed (for example, two instead of three copies). These regions include the genes KMT2C, H19, MAP2K3, KCNJ18, KCNE1, CBS, U2AF1, CRYAA, TRAPPC10, DNMT3L, DGCR6 and PRODH. For the nine genes that are incorrectly duplicated, DRAGEN was able to circumvent this bias and reported 30,306 variants. This is in stark contrast to the GATK callset, which reports almost 19.98% fewer (24,249) variants. As an example, for CBS, related to cystathionine β-synthase deficiency65, only 221 variants were reported across 1kGP in a previous study60. DRAGEN reported 1,196 variants in the callset due to the use of the multigenome mapping. For the H19 gene, which is related to skeletal muscle disease66, DRAGEN found 335 variants; however, GATK found no variants. For genes that were impacted due to collapsed errors, we expected an inflated number of variants due to multiple haplotypes collapsing on top of each other23. For MAP2K3, which is related to skin and liver diseases67, and KCNJ18, which is related to other rare diseases68, GATK discovered 487 more variants than DRAGEN, which are likely false positives23 (Supplementary Table 11).Unification of large alleles across 3,202 individualsWe next investigated the larger variants identified by DRAGEN encompassing STRs (50,069 regions), SVs and CNVs. As described above, we merged all large variant types across the samples into one population VCF file. We identified 409,033 STRs (243,083 expansions (that is, the reference had fewer repeat units) and 165,950 contractions (that is, the reference had more repeat units)), 1,013,541 SVs (with 200,713 deletions, 450,581 insertions and 28,574 tandem duplications and 333,673 other types of SV) and 9,216 CNVs (5,322 deletions and 3,894 duplications) across the entire 1kGP dataset (Supplementary Table 12). We first performed a PCA to investigate if these calls followed the expected population structure. Figure 5a shows the PCA colored by superpopulation. Overall, we observed a clear separation following the population structure in PC1 and PC2. The large variant PCA had a highly similar structure as that observed in the small variant PCA. The stratification is likely also driven by the higher variant numbers that we observed across the AFR population than observed across the other ancestries, which is also similar to the structure that we observed in the small variant PCA. Figure 5b,c shows the distribution of insertions and deletions per population. Across all SV types, we observed the expected distributions of variant counts with a slight increase of insertions over deletions. Although it remains challenging to identify insertions from short reads, we observed the relatively high numbers of DRAGEN insertions obtained following the general population structure. Figure 5d shows the average number of SVs per individual for each population. Interestingly, although we observe increases in insertions and deletions in the AFR population compared to other populations, the same was not observed for duplications or inversions.Fig. 5: DRAGEN SV calls for the 1kGP sample.a, PCA of merged STRs, SVs and CNVs of 3,202 1kGP samples for deletions with a minor allele frequency of ≥5%. b,c, Distributions of insertion- (b) and deletion-type (c) SVs (≥50 bp) across 3,202 1kGP samples. d, Distribution of SVs, STRs and CNVs based on average count, that is, total variants of a population/population count; TRA, translocations. e, Distribution of variant numbers among all 3,202 samples for the 12 CMRG regions (in GRCh38) that are impacted due to false duplication and falsely collapsed errors. DRAGEN uses the corrected reference as a part of its multigenome approach to correctly identify more variants in duplicated and collapsed regions. f, Class I HLA allele frequency distributions among 3,202 1kGP populations. The box plots in b, c, e and f display the minimum, maximum, median and spread of the middle 50% of the data (the IQR), with whiskers indicating the range of the data within 1.5× the IQR and points beyond the whiskers representing outliers.We next investigated how many of these variants have been identified previously19,60. For this task, we used ICA to annotate variant intersections to 1kGP, gnomAD and TOPMed. Across all variants, we observed 616,371 known variants and 1,035,684 novel variants. Supplementary Table 13 contains the distribution based on allele frequency. To cross check the consistency of the dataset, we correlated the allele frequency of the callsets for the overlapping variant calls. We observed a positive correlation (Pearson correlation coefficient = 0.999, P = 0.0) between allele frequency and the count of variants from the 1kGP database. We next checked for the overlap of variants with exonic, intronic and intergenic regions. A total of 31,223 SVs (7,058 deletions, 9,932 insertions and 14,233 other types of SV) overlapped exonic regions, 558,706 overlapped intronic regions (254,483 insertions, 108,601 deletions and 195,622 other types of SV) and 423,612 overlapped intergenic regions (186,166 insertions, 85,054 deletions and 152,392 other types of SV).We obtained further insight into SV diversity along medically important genes. As the 1kGP samples represent healthy individuals, their SVs could be used as controls in studies aiming to identify potentially pathogenic variants. We compared DRAGEN SVs to results that were recently published60 from a joint calling ensemble approach (GATK-SV61, svtools62 and Absinthe63). Across the 5,030 CMRG regions, DRAGEN identified 338,108 variants (268,850 SVs, 66,371 STRs and 2,887 CNVs). The SV callset that was published in recent studies reported only 27,166 SVs, with 8,093 insertions, 13,506 deletions and 5,567 other types of SV. DRAGEN discovered 40,810 more deletions and 90,693 more insertions. Within these medically relevant genes, there are 12 genes that are often negatively impacted by reference bias23. As mentioned before, some genes suffer from wrongly collapsed copies, which leads to an increased number of variants23. On the other hand, there are several genes that have been wrongly reported multiple times across the genome, which often leads to missing variant calls due to their repetitiveness23. For the duplicated and collapsed regions, a total of 101 and 188 large variants were identified by DRAGEN, and the majority of them are SVs (98.02% and 96.86%). By contrast, the previous study only reported 36 SVs in collapsed and 18 SVs in duplicated regions across the entire 1kGP sample. At the cohort level, on average, each individual had 11 variants that were identified in the erroneous regions. For the AFR population, the average number of variants was 13, and for other populations it was between 9 and 10 variants per sample. The distributions of the total number of variants by DRAGEN at the duplicated erroneous regions are higher than the numbers reported in the previous studies, and the numbers are lower in the collapsed regions. This shows the improvement of variant calling by DRAGEN that incorporates the corrected regions during variant calls (Fig. 5e and Supplementary Table 14). A lower number of variants is expected in the collapsed erroneous regions if the corrected reference is used as these erroneous regions in the original GRCh38 reference with more than one copy are collapsed into one.Insights across complex medically relevant genes for 1kGPLast, we investigated results from the DRAGEN specialized gene callers (for example, CYPB26, CYPD26 (ref. 26), GBA28, HLA and SMN1 and SMN2 (ref. 27)) to obtain deeper insights into potential preconditions across the 1kGP dataset. Furthermore, this dataset can be leveraged as population controls for these important but complex genes.For the CYP2B6 caller, 2,017 samples had genotypes containing two haplotype-specific star alleles (filter status PASS), 1,174 samples had more than one possible genotype, and 11 samples (10 AFR and 1 EUR) had no calls reported. The metabolizer status reported in these calls showed that among samples with PASS filter, 1,189 were normal metabolizers, 381 were poor metabolizers, 154 were rapid metabolizers, 7 were ultrarapid metabolizers, 224 were intermediate metabolizers, 57 had indeterminate status, and 1,190 had none status. 858 samples had the *1/*1 genotype, and among the samples with multiple genotypes, 945 of them had genotype ‘*1/*6;*4/*9’.For CYP2D6 calls across all samples, only two samples (one EUR and one SAS) had more than one possible genotype. There were 12 with no calls (2 AFR, 1 SAS, 6 EAS and 3 AMR), and the remaining 3,188 samples had one genotype with two haplotype-specific star alleles. The metabolizer status showed that 1,556 samples had normal status, 781 had intermediate status, 66 had poor status, 107 had ultrarapid status, 149 had indeterminate status, and 543 had none status.The results of the CYP21A2 caller showed that the total copy numbers were in the range between two and eight, with 58.87% of samples having a copy number of four. For copy numbers greater than or equal to five, there were 687 (21.46%) samples, and 55.31% of samples were from the Asian population (EAS or SAS). It was also reported that 24 samples had copy numbers of less than or equal to three, and a deletion recombinant variant was detected in CYP21A2.For the GBA28 caller that detects both recombinant-like and non-recombinant-like variants in the GBA gene, DRAGEN reported no samples with any presence of a recombinant-like variant on each chromosome (homozygous variant or compound heterozygous). However, it reported 13 samples (3 AFR, 5 EUR, 1 AMR, 1 SAS and 3 EAS) with the presence of a recombinant-like variant on only one chromosome. The reported total copy number values showed that the majority of samples (95.47%) had aggregate copy numbers of four across genes and pseudogenes. Only 16 samples had an aggregate copy number of three, and the remaining 129 samples (101 AFR, 1 EUR, 6 AMR, 6 EAS and 15 SAS) had aggregate copy numbers in the range between five and ten. GBA reported only one sample (of EAS) that had one deletion breakpoint in the GBA gene, which indicates whether the sample has one of the recombinant-like deletion variants.For the SMN caller, DRAGEN reported spinal muscular atrophy affected status as ‘false’ for all samples and spinal muscular atrophy carrier status as ‘true’ for 49 (1.53%) samples (3 AFR, 19 EUR, 12 AMR, 7 EAS and 8 SAS). This is in the range of rates of carriers, which is between 2.50% and 1.67% across the population69. The copy number of SMN1 was reported to be two for the majority of samples (2,428), and it was not reported for 19 samples (none for SMN1_CN). For SMN2, 395 samples had zero copies, 1,275 had a copy number of one, 1,427 had a copy number of two, 86 had three or four copies and 19 had no reported copy number.The DRAGEN HLA caller reports HLA typing results of six class I alleles (that is, HLA-A1, HLA-A2, HLA-B1, HLA-B2, HLA-C1 and HLA-C2) and reported 60 distinct alleles for HLA-A1, 70 for HLA-A2, 121 for HLA-B1, 132 for HLA-B2, 43 for HLA-C1 and 57 for HLA-C2. For the HLA-A1 type, HLA-A*02:02 was reported to be the allele with the highest allele frequency of 15.8%, followed by HLA-A*11:01 with 11.62%, and the remaining alleles were within 0.03% and 10.06%. For HLA-A2, the allele HLA-A*02:01 was reported to have the highest allele frequency of 13.34%, and all others were within 0.03% and 9.90%. For HLA-B1, the allele HLA-B*07:02 had the highest allele frequency of 6.71%, and the remaining alleles were between 0.03% and 5.78%. For HLA-B2, the HLA-B*35:01 allele had the highest allele frequency of 6.62%, and the remaining alleles were between 0.03% and 5.68%. For HLA-C1, the highest allele frequency of 17.36% was reported for the allele HLA-C*04:01, and the remaining alleles were in the range between 0.03% and 13.46%. Last, for HLA-C2, again the allele HLA-C*04:01 was reported to have the highest allele frequency of 12.05%, and the other alleles were within 0.03% and 8.81%. The allele frequency distribution of HLA type 1 classes among all 1kGP populations is shown in Fig. 5f. Supplementary Table 15 provides details for HLA type counts.The LPA caller reported the total copy numbers for all 3,202 samples and allele-specific copy numbers for 1,507 (47.06%) samples. The highest and lowest estimated total copy numbers were 80.76 and 11.62, respectively, and both were SAS samples. For allele-specific copy numbers, the highest and lowest were found to be 60.25 and 1.99, respectively, for the first haplotype and 47.90 and 1.99, respectively, for the second haplotype. The distribution of total copy number counts showed that 80% of the samples were between 30 and 50.The RH caller results showed that the total copy numbers in the RHD and RHCE regions were in the range between two and five. The majority (63.65%) of the samples had total copy numbers of four, and 29.76% samples had total copy numbers of three. For RHD gene regions, 63.49% of samples had a copy number of two, and 29.92% of samples had a copy number of one. It was also observed that 200 samples had copy numbers of zero, and only 13 samples had a copy number of four. For RHCE regions, 99.63% of samples had a copy number of two, only five samples had a copy number of one, and seven samples had a copy number of three.Thus, here, we have demonstrated the accuracy and scalability of the DRAGEN framework across all variant types. We have demonstrated this across all different variant classes across a wide spectrum of human populations and with a focus on genome-wide and medically relevant genes. This revealed many variants (SNVs and CNVs) that were not detectable in previous studies of this dataset. Furthermore, we can provide this more comprehensive callset together with the results of the specialized callers as a population reference for future studies.

Hot Topics

Related Articles