The GIAB genomic stratifications resource for human reference genomes

The GIAB stratification resource is a publicly available dataset for the human reference genome. Here we describe the extension of this resource to the CHM13 reference genome (Table 1). We also provide an insight to the differences between three reference genomes: GRCh37, GRCh38 and CHM13. Furthermore, we explore three new features for future stratifications of GRCh38 which include variant complexity in tandem repeats, distribution of genomic distance between consecutive variants, and read coverage of each variant. We now automated generating all stratifications to further facilitate the creation across upcoming complete human genomes or other reference genomes that have been annotated with RefSeq, RepeatMasker, segmental duplications, and Tandem Repeat Finder: https://github.com/usnistgov/giab-stratifications.Table 1 Overview of existing and newly-added stratification types for the three reference genomesExtending CHM13 stratificationsCoding sequences (CDS) are the regions of the genome that code for proteins which are usually targeted for many clinical tests6. Using the CHM13v2.0 assembly and available RefSeq annotations10, we were able to extract gene coding regions and compare them with those of GRCh38 and GRCh37. The methods used for generating CDS stratifications for GRCh38 were applied to CHM13v2.01. Figure 1 shows the comparison between the total length of the CDS region and its ratio over chromosome length (excluding unknown bases in the assembly which are noted with N) for the all three references, GRCh37, GRCh38 and CHM13. All three reference genomes have similar CDS coverage across chromosomes. Also, the differences that we see in chromosomes 9, 19, and 22 are likely due to the fact that many new bases were added in CHM13 relative to GRCh38 on these chromosomes as they have large centromeres and heterochromatin that were excluded in GRCh3812.Fig. 1: Statistics of the gene coding sequences (CDS).a Total length of CDS regions for GRCh37, GRCh38 and CHM13. b The ratio of length of CDS regions over chromosome length (excluding unknown bases) for GRCh37, GRCh38 and CHM13.Reference mappability is a metric that can be used to identify whether reads of a given length will align uniquely to that region of the genome (lower means to harder-to-map). The regions that are found to be of low mappability had been previously generated for GRCh37 and GRCh38. We calculated such regions for the CHM13 reference genome at two different stringencies. For moderately low-mappable regions, we permitted up to two mismatches and one INDEL between each 100 bp region and any other region. However, for highly low-mappable regions, we permitted no mismatches or INDELs between each 250 bp region and any other region. Note that 100 and 250 bp correspond approximately to two common lengths used for short-read sequencing, so these regions can be interpreted as those that are difficult to map for short reads. The total length of low mappability regions for each chromosome is depicted in Fig. 2 for all three reference genomes. The total length of the regions in the CHM13 is higher than that of the older references.Fig. 2: Statistics of low-mappability regions in GRCh37, GRCh38 and CHM13.a Total length of low-mappable regions, and b Ratio of total length over chromosome length (excluding unknown bases) for GRCh37, GRCh38 and CHM13.To further explore the differences in hard-to-map regions between references, we plotted intra-chromosomal coverage of all low-mappability regions for CHM13 and GRCh38 (i.e., 100 and 250 bp stratifications together, Supplementary Fig. 1). Generally, each reference had large spikes near the center of each chromosome, corresponding to the centromeres which are known to be repetitive. However, there were some major increases in CHM13 relative to GRCh38. First, we observed large increases in chromosomes 1 and 9; both of these chromosomes have large satellite repeats which would explain this increase. Additionally, we observed large increases in the short arms of chromosomes 13, 14, 15, 21, and 22, which contain large rDNA arrays which are highly repetitive and thus difficult to map14. Finally, chromosome Y showed a large increase for over half the chromosome, which can be explained by the large number of ampliconic regions which were added to CHM1313.Defining high and low GC-content regions (i.e., the fraction of G and C bases is high or low) is important as different sequencing technologies can produce distinct error profiles in GC-rich and AT-rich regions15. This stratification delineates regions with specific amounts of GC content. These are based on the method used for generating standardized GC content BED files by the GA4GH Benchmarking Team and the GIAB. Of note, we consider 10 different ranges from 15 to 85% of GC content with interval length of 5% in addition to two cases of regions with GC content of smaller than 15% and greater than 85%.As an example, we depicted the total length of regions for each chromosome of the human reference genomes GRCh37, GRCh38 and CHM13 with the GC content in the range of 20-25% in Fig. 3a. The ratio of total length of regions over chromosome length (excluding unknown bases) is illustrated in Fig. 3b and d. As we can see, three reference genomes follow a similar pattern except for chromosome 13 in CHM13. Moreover, Fig. 3c shows the total length of regions with GC content higher than 85% for GRCh37, GRCh38 and CHM13. Upon investigating intra-chromosomal coverage of regions with >85% GC content, we observed a large increase in the short arms of chromosomes 13, 14, 15, 21, and 22 in CHM13 relative to GRCh38, corresponding to the rDNA arrays (Supplementary Fig. 2).Fig. 3: Statistics of regions with specific GC content for GRCh37, GRCh38 and CHM13.a Total length of regions with GC content in the range of 20–25%. b Ratio of total length of regions with 20-25% GC content over chromosome length (excluding unknown bases). c Length of regions with GC content higher than 85% in log scale. d Ratio of total length of regions with GC content >85% over chromosome length in log scale (excluding unknown bases).Three challenging, medically-relevant regions within human genomes including the Major Histocompatibility Complex (MHC), variable/diversity/joining (VDJ) and Killer-cell immunoglobulin-like receptor (KIR) are considered here16. These three regions are all highly polymorphic and underpin key immunological functions: the MHC region contains the Human Leukocyte Antigen (HLA) genes which determine “donor matches,” the VDJ regions are randomly recombined to produce the T and B cell receptors in T and B cells, respectively, and the KIR region codes for one of the key effector receptors on natural killer cells. The total length of each region on the three reference genomes, GRCh37, GRCh38 and CHM13, are reported in Table 2. As shown, the regions are located on the same chromosome across different reference genomes with comparable total length.Table 2 Total length of three difficult genomic regions, namely MHC, VDJ and KIR, on reference genomes GRCh37, GRCh38 and CHM13Evaluating the utility of stratifications for benchmarkingWe demonstrated the usefulness of these stratifications for their most common use case which is benchmarking variant caller performance across the genome using a benchmarking tool called hap.py from the GA4GH Benchmarking Team2.First, we assessed the differences between the three references within different region types. To do this we utilized our draft assembly-based benchmark for HG002, which was constructed from the HG002 T2T Q100 v1.0 diploid assembly (see Methods)13. This benchmark is made from a complete, accurate assembly as opposed to the current mapping-based callsets from GIAB. Thus, it includes more difficult regions than were previously available and can be created from the alignments of the assembly to any reference, which makes it well-suited to comprehensively test these three references. We then benchmarked a HiFi-Deepvariant callset (as the query) using the draft assembly-based benchmark (as the truth) and evaluated precision and recall of variants (Fig. 4A). In this figure, the value of each bar is the aggregated Phred-scaled score, and the error bars are the estimated 95% binomial confidence intervals.Fig. 4: Stratifications reveal nuances in precision and recall performance when benchmarking using hap.py.A Performance within important stratifications using assembly-based HG002 benchmark and GRCh37, GRCh38, or CHM13 as reference and a HiFi-DeepVariant query callset. B The CHM13 performance results from (A) compared with the same benchmarking pipeline restricted to nonsyntenic regions relative to GRCh38. C Performance within all autosomes or tandem repeats/homopolymer regions for ONT callsets created with either guppy4+clair1 or guppy5+clair3. D Comparison of HiFi and Illumina callsets on CHM13 using the Q100 benchmark. Each bar is the mean of the given metric which is also shown as text. Error bars are 95% binomial confidence intervals computed with the Wilson method (see Methods). Stratification meaning on y axes: Lowmap = low-mappability regions (100 and 250 bp sizes); High/Low GC = GC content > 25% or > 65%; SegDup = segmental duplications >= 1 kb; TRs = tandem repeats; HPs = homopolymers >= 7 bp or imperfect homopolymers >= 11 bp; Difficult = SegDup+LowMap+HPs+TRs+XY PAR/XTR/Ampliconic+High/Low GC; Autosomes = all autosomal regions; 2-mer TRs = repeats with unit size 2; Short TRs = tandem repeats <50 bp long.Across many stratification categories, CHM13 had a lower score than GRCh38, which in turn also had a lower score than GRCh37. This can be explained by the fact that each new reference progressively added more difficult regions to the human genome as technology improved with time. The largest differences between references were for SNVs in segmental duplications and low-mappability regions, which have been increasingly included in GRCh38 and CHM13. CHM13 corrected some false segmental duplications in GRCh38 and added segmental duplications missing in GRCh38 and GRCh3717, which can cause lower accuracy in GRCh38 and GRCh37. However, these callsets used the GIABv3 refined version of GRCh38 that masks the false duplications and adds decoys for a few missing sequences18, as well as the version of GRCh37 that includes the hs37d5 decoy. In addition, the overall SNV precision and recall for TRs and HomoPolymer (HPs) was lower than some other stratification types such as low-mappability regions (lowmap) and high/low GC. This is likely due to the HiFi platform used to generate the callset, which is known to be more error-prone in these repeat regions1. However, when focusing on our stratification excluding difficult regions, performance was similar across references. Furthermore, the scores for INDELs were lower than SNVs across all stratifications and metrics, which tend to be more challenging than SNVs due to alignment challenges. INDELs also likely bias to lower scores due to hap.py using vcfeval19 for comparing variants, which does not give partial credit for complex INDEL variants that frequently occur in TRs and HPs20. In summary, this shows how stratifications can be used to understand how performance differs across the three references. It also justifies the use of CHM13 over the other previous references, as using GRCh38/37 could result in an inflated score depending on the regions under study.Next, we asked how the new regions (i.e., “nonsyntenic”) in CHM13 relative to GRCh38 specifically impacted performance. The nonsyntenic regions contain 1000 s of variants (Supplementary Fig. 3), many of which probably have biological significance but have been difficult to study without a complete reference such as CHM13. We modified the analysis in Fig. 4A to subset to either the syntenic or nonsyntenic regions prior to benchmarking using the targeted flag in hap.py (see Methods) (Fig. 4B). 93% of the benchmark variants in nonsyntenic regions were in our stratification that contained all difficult regions. For INDELs and SNVs, the difference in Phred score between syntenic and nonsyntenic regions were about 5 and 15 respectively for both precision and recall across all stratifications (nonsyntenic being much lower in general). In summary, this indicated that the changes from GRCh38 to CHM13 indeed included much more difficult regions, and our stratifications enable understanding of the differences between performance metrics across references.Additionally, we used the stratifications to assess the performance improvements of recently published software components in the Oxford Nanopore Technologies (ONT) variant calling pipeline, specifically guppy and clair, which are a base caller and variant caller respectively developed specifically for ONT reads (Fig. 4C)21. The ONT reads were acquired from HG003, and the same reads were used for both versions of guppy/clair (guppy4+clair1 and guppy5+clair5). The experimental setup was similar to that of Fig. 4A and B except that we used the v4.2.1 GIAB HG003 benchmark since the GIAB team has yet to build an assembly-based benchmark for HG003. We observed that in general, there was a substantial performance gain (up to 10 Phred-scaled) between the old and new caller versions, as expected. However, this gain was not uniform. For HPs and/or TRs, the precision/recall metrics were less than that for “Autosomes” (representing a global mean for all autosomes), which in turn was less than stratifications which excluded HPs and TRs. For SNVs in particular, HPs and TRs had modest performance gain for precision but substantial gain for recall. This agrees with previous results that the ONT platform has a higher error rate in homopolymers1. In totality, this shows how the stratifications can be used to benchmark performance of new technologies as they evolve, which can serve as a valuable resource both for those developing these platforms as well as consumers who need to buy or update their workflows.Finally, we used the draft Q100 benchmark with our new stratifications to compare performance metrics for variants called from HiFi and Illumina reads aligned to the CHM13 reference genome and called with DeepVariant (Fig. 4D). While performance in many regions is similar between these HiFi and Illumina DeepVariant callsets, the stratifications highlight a few notable differences. For example, regions with low-mappability (defined for short reads) and segmental duplications experienced lower recall of both SNVs and INDELs for Illumina-DeepVariant. Additionally, recall in low-mappability and segmental duplication regions for HiFi-DeepVariant was lower than the other stratifications at about 85% for INDELS and 90-95% for SNVs, indicating that there is still room for improvement even with long reads. Because CHM13 includes more low-mappability regions and segmental duplications, our stratifications for CHM13 are particularly important. In addition, our stratifications for GC vs. AT homopolymers help highlight how AT homopolymers are similar between the callsets, whereas HiFi-DeepVariant had higher accuracy for SNVs in GC homopolymers and Illumina-DeepVariant had higher accuracy for INDELs in GC homopolymers. Importantly, our stratifications help the community continue to evaluate the performance of long and short read sequencing technologies in different genomic contexts as new technologies and analysis methods are developed.Exploring new features for future stratificationsIn this study we expanded many of the stratifications to CHM13. We also explored possible three new stratifications for GRCh38 in addition to the so far described well-established stratifications (Table 1). We examined variant complexity in tandem repeats, distribution of genomic distance between consecutive variants, and read coverage of each variant.Tandem repeats frequently contain complex variants (e.g., multiple SNVs and INDELs), which can cause errors in variant calling and benchmarking22. Variant callers tend to produce more errors in repeats because variants could be mis-identified among the repeat, particularly if reads are insufficiently long to contain the entire repeat. Notably, such errors are likely to increase when there is more than one variant in the region, because if any variants are filtered incorrectly then all variants in the region can be counted as FPs and FNs due to differences in representation20. It may thus be sensible to create new stratifications for tandem repeats categories by number of variants, where number of variants corresponds to difficulty.To understand how variants were distributed in tandem repeats, we intersected the Q100 variant benchmark of the HG002 sample23 with the GIAB tandem repeat and homopolymer stratification BED files. This variant call format (VCF) file includes both small variants and structural variants (SVs) (except for inversions and translocations), so that we can assess the full spectrum of variants in tandem repeats4. After splitting multiallelic variants, we additionally filtered any variants that overlapped repeat boundaries (~1500 variants), though these would be important sources of complexity to explore in the future. We found that the vast majority (~90%) of the repeats in GRCh38 did not have any variation, but >10,000 tandem repeats contain more than one variant and >1,000 contain more than three variants, resulting in complex variants that can cause challenges in variant calling and variant representation (Fig. 5a).Fig. 5: Distribution of SNV and small INDEL variants within tandem repeats throughout GRCh38 using the HG002 Q100 variant benchmark.a The distribution of the number of variants per repeat. Y-axis shows the number of tandem repeats and x-axis is the number of variants in each tandem repeat. b, c Among repeats with only one variant, the fraction of the variant class by chromosome b and the distribution of intersecting variants classified by type according to repeat length (c) INDEL2; INDELs with length <= 2, INDEL49; INDELs with length > 2 and length <= 49, SNV; single nucleotide variants, SV; structural variants. d Number of variants in tandem repeats, segmental duplications and all other regions. e Variant density in regions of tandem repeats and segmental duplications. f Performance within new stratifications using HG002 Q100 benchmark and an Illumina DeepVariant query callset for tandem repeat regions with different number of variants inside. g Performance within regions with different genomic distance between variants. h Performance within regions with different coverage values for a variant set called from a BAM file with mean coverage of 40×. For (f–h) each bar represents the mean of the given metric. Error bars are 95% binomial confidence intervals computed with the Wilson method (see Methods).We further investigated the distributions of repeat region size and variants inside repeats with only one variant. About 30% of such variants were SNVs, and about 50% were INDELs between 1 bp and 2 bps (Fig. 5b). There were several hundred SVs associated with these repeats as well, indicating that some SVs in repeat regions exist without any smaller variants in the same repeat. This distribution was mostly consistent across chromosomes. Furthermore, we investigated the size distribution of repeats with a single variant according to variant type (Fig. 5c). We found that small INDELs are disproportionately more likely to be in repeats <80 bp long compared to other variant types. Of note, the variant density in tandem repeats is more than four times of that in segmental duplications (Fig. 5d, e).We also assessed the quality of variant calling when variants are in tandem repeat regions. To do so, we considered the Illumina HiSeq X DeepVariant callset as the query and used the HG002 Q100 small variant benchmark as the truth. We stratified the repeat regions based on the number of variants inside. Overall, the results show that more variants in a tandem repeat region corresponds to lower precision and recall in variant calling. In fact, false positive and false negative rates for both SNVs and INDELs are more than 10 times higher in tandem repeats with >10 variants relative to those with a single variant (Fig. 5f). Tandem repeats with a single variant have lower error rates than variants outside tandem repeats on average.Next, we explored the distribution of genomic distance between consecutive variants. Understanding variant distribution is important, since the likelihood of representational difficulties increases as the distance between any two variants decreases. Furthermore, in extreme cases, too many variants within a region can lead to a reference allele bias due to e.g., mapping biases24,25. In addition, high variant density can result from tandem repeats and gene conversions, which can cause mapping errors26. Conversely, read-based phasing of variants becomes more challenging as distance between heterozygous variants increases27,28,29, which can cause diploid assembly errors6.To calculate distance between variants, we again use the draft GIAB benchmark based on the HG002 Q100 diploid assembly aligned to GRCh38. Supplementary Fig. 4 depicts the distribution of the genomic distance between any two consecutive variants for all autosomal chromosomes. This draft benchmark is a relatively comprehensive characterization of variants, though variant density can depend on how variants are represented. To assess how accuracy depends on distance between variants, we analyzed variants called with DeepVariant for HG002 sequenced with Illumina HiSeq X platform (Fig. 5g).We observed that when variants are close to each other (within 1-10 bp of another variant), the quality of called variants is lower than those variants that are further apart (100–1000 bp), with SNVs having the highest precision (25.5 Phred) with recall (19.8 Phred). Of note, 3.4 million (49.3%) of the 6.9 million variants have a neighboring variant within 100 bp. Interestingly, around one-fifth of the variants are in repetitive regions including tandem repeats and segmental duplications, which may cause variant calling and representation challenges. However, a very high distance (>10kbp) between variants corresponds to lower quality called variants, especially lower precision. This lower precision in part appears to result from the lower density of true variants in the region, decreasing the denominator in the precision calculation, and many false positives are in clusters due to mis-mapping in segmental duplications.In addition to stratifications mentioned above, here we considered HG002 to explore the read coverage of each variant, where abnormal coverage implies that the region flanking the variant is a difficult region to align or call variants due to potential mapping challenges and/or misrepresentations (e.g., copy number variants). Additionally, the idea behind calculating coverage values is to look for new informative metrics, which can be fed as an additional feature into machine learning models to predict the quality of called variants30. Obviously, one would expect sufficient coverage leads to higher variant calling accuracy. However, too high coverage may indicate mis-mapped reads due to duplications in the individual, which can result in false variant calls. Another cause of abnormal coverage is small deletions and insertions, so we have chosen to look at two different variant types: INDELs and SNVs.Accordingly, we calculated the average read coverage of genomic positions for each variant in an Illumina-DeepVariant callset with 40x mean coverage (see Methods). As expected, precision and recall were highest near 40x coverage. Variants in regions with low coverage (<20×) had 1–2 orders of magnitude lower precision and recall, likely due to mapping and genotype errors and filtering of true variants (Fig. 5h). Very high coverage >80 also resulted in higher error rates, likely due to mapping errors resulting from duplications in HG002 and/or in GRCh38. These show that variants are most accurate when coverage is near the mean of 40× (i.e., between 20× and 60×), with a Phred score larger than 20, equivalent to precision/recall > 0.99 (Fig. 5h).

Hot Topics

Related Articles