StratoMod: predicting sequencing and variant calling errors with interpretable machine learning

Figure 1 shows the overall approach of StratoMod. For this study, we trained multiple iterations of StratoMod to showcase its ability to predict recall, precision, or Jaccard index (Fig. 1c). In the sections “Use case 1”, “Use case 2” and “Other use cases” below, we use StratoMod to predict recall of several pipeline configurations. We focused on recall in this study since this is currently difficult to predict for pipelines, and thus showcases StratoMod’s value. We also trained StratoMod to predict precision in Supplementary Note 2. In “Model Validation” below we use StratoMod to predict the Jaccard index (TP/(FP + FN + TP)) for variants in ClinVar, which we compare to the likelihood of matching variants in gnomAD being filtered. Here we predicted the Jaccard index instead of precision or recall since a filtered variant in gnomAD may correspond to either a false positive or false negative.Fig. 1: Graphical overview of StratoMod.a Conceptual framework for mapping “genomic context” to a machine-understandable value (with homopolymer length as an example). b Flow chart for the analysis pipeline, where a query callset is compared to the benchmark to identify TPs, FPs, and FNs. These are then intersected with regions and metadata describing different genomic contexts, producing a mapping between variant calling results (TP, FP, FN) vs genomic context which are used as labels and features in the model respectively. c This data frame is then filtered for different labels and is assigned positive (“+” for TP) or negative (“−” for FN and/or FP) class depending on the desired prediction. Subsetting to the benchmark variants (FN + TP), query variants (FP + TP), or both (FN + FP + TP) will predict recall, precision, or Jaccard index respectively. SegDups Segmental Duplications, LINE long interspersed nuclear element, SINE short interspersed nuclear element.In each case, we trained two individual models for INDELs and SNVs separately. Furthermore, all false positive (FP) and false negative (FN) errors were determined with respect to either the GIAB v4.2.1 benchmark VCFs or our new T2T-HG002-Q100 assembly-based benchmark (see “Methods” section) using GRCh38 as the reference. Lastly, we simplified the analysis by splitting multi-nucleotide variants, removing SVs (variants larger than 50 bp), genotype errors, and any variants that appeared in the MHC regions from both the benchmark and query VCF input files.Use case 1: predicting recall in Illumina PCR-free and PacBio HiFi variant calling pipelinesWe first asked if StratoMod could be used to assess the likelihood of missing a variant given its genomic context and the read length used to assess the variant. To this end, we trained StratoMod using our new draft assembly-based benchmark for HG002, which is based on the near-perfectly complete Q100 HG002 assembly5 and thus allows us to assess variation in some of the most difficult-to-analyze regions of the genome (Fig. 2a). Furthermore, we used the trained models to predict recall for pathogenic/likely pathogenic ClinVar variants, which should provide useful insight to diagnostic test developers and clinicians who may be concerned about the likelihood of missing a potentially lethal variant in a patient.Fig. 2: StratoMod found context-specific regions where variants are likely to be missed in either HiFi or Illumina analysis pipelines.a Experimental overview b IGV images depicting false negative calls identified by this model. c–f EBM interaction feature plots for SNV hard-to-map regions (c), SNV LINEs and segmental duplications (d), INDEL hard-to-map regions (e), and INDEL tandem repeats/indel length (f). LINE long interspersed nuclear element. Error bars and ribbons around step plots are model errors.Identifying driving features that influence recallWe trained separate models for SNVs and INDELs using VCFs generated using DeepVariant with GRCh38 as the reference and reads generated from HG002 using either Illumina PCR-free or Pacbio Hifi. Each model was trained on false negatives (FN, defined as variants in the benchmark that are missing or have an incorrect allele or genotype in the query)23 and true positives (TP, defined as variants matching a benchmark variant and genotype) variant call classifications, as reported by vcfeval using our Q100 HG002 assembly-based benchmark as the truth set. The model utilized 22 main effect (univariate) features, including 1 categorical feature denoting the sequencing platform (Hifi or Illumina). These main effect features included quantified characteristics of homopolymers, tandem repeats, segmental duplications, and other repetitive elements, many of which we have seen cause mapping errors and/or sequencing errors which will lead to missed variants. Each model also included interaction terms between the 22 main effects and the Hifi/Illumina categorical feature, allowing the model to show the behavior of each main effect conditional on technology (see “Methods” section and Supplementary Table 1 for a complete list of features).We show two examples of such variants that might be predicted using StratoMod in Fig. 2b. The left panel depicts a variant in a LINE (long interspersed nuclear element) that Illumina missed but Hifi called correctly, presumably because long reads were able to map correctly to the LINE. The right panel depicts a homopolymer where the variant was called correctly from Illumina data, but DeepVariant incorrectly classified the candidate variant as a homozygous reference from HiFi data, presumably because HiFi data is noisier due to the homopolymer.StratoMod enables us to systematically quantify error likelihood for variants such as these using feature profiles (Fig. 2c–f). Each profile is actually the addition of three features output by the model: the main effect of the sequencing platform (color), a region type such as tandem repeat length (x-axis), and the interaction of the two (divergence of the two colors in the profile). The y-axis is the log2 odds ratio of the predicted recall (i.e. an increase in 1 corresponds to twice as likely to find a true variant). For SNV and INDELs, we found that hard-to-map regions were unsurprisingly predictive of higher error rates (Fig. 2c, e). Furthermore, the increase in error rate was less for Hifi than Illumina, likely due to longer read lengths provided by Hifi assisting in mapping. For SNVs, we also observed that variants in a LINE were more likely to be missed than those not in LINEs, and the difference in predicted score between Illumina and HiFi became larger as length increased (Fig. 2c). There was a spike in the FN rate around 6000 bp long (corresponding to full-length L1 LINEs), possibly because full-length LINEs are more recent and similar to each other. Interestingly, segmental duplications for SNVs did not show a wide divergence between Illumina and Hifi, although the error rate increased with increasing similarity. This could be partly explained by the difference in the difficult-to-map effect in Fig. 2c between Hifi and Illumina, which overlaps with the segmental duplications effect in Fig 2d (as segmental duplications are also hard-to-map generally). For INDELs we observed that increasing tandem repeat length generally increased error rate, and Hifi’s error rate decreased slower, in particular as repeat length exceeded the length of a short-read (100 s) (Fig. 2f). Furthermore, long deletions and insertions generally had higher error rates, and Hifi had a lower error rate relative to Illumina for long insertions, likely due to the fact that mapping becomes harder as insertions become longer. While these only represented a subset of all features tested, they represent many of the high-leverage features that contribute to the resulting prediction (Supplementary Fig. 1a). In total, these profiles support what we already know about short vs long reads given our intuition about mappability and long repetitive regions; however, they further enable better precision for understanding both the error rate and the types of regions that lead to a given error rate. An important caveat is that these profiles can change substantially depending on the mapping and variant calling methodology, as we show below for linear vs. graph-based references.Performance assessment and prediction of missing clinically relevant variantsEach model was trained using an 80/20 test split on HG002 (Ashkenazi Jewish ancestry) (Fig. 2a); we additionally tested the model on HG005 and HG007 (Han Chinese ancestry) to assess its generalizability to other genomes. Since we did not have an assembly-based benchmark for these, we used our v4.2.1 benchmarks instead and also tested the v4.2.1 benchmark for HG002 which we expected would give us similar results to the assembly-based HG002 benchmark (which is mostly a superset of the v4.2.1 benchmark). We observed that both precision and recall (note these two metrics refer to the model’s classification performance and are distinct from the predicted precision and recall which describe StratoMod’s output) were generally similar and high between all models and genomes (Supplementary Fig. 1b). SNV performance for HG002 Q100 was near 1.0 for the area under the precision-recall (PR) curve and about 0.96 for the area under the receiver-operator (ROC) curve for both Illumina and Hifi (Supplementary Fig. 2d). INDEL performance was lower (PR = 0.84–0.86 and ROC = ~0.985). These metrics were similar for HG002 between the two different benchmarks, with Illumina INDELs being slightly higher for v4.2.1. The other genomes were also similar, again with the exception of Illumina INDELs were slightly more performant. We also noted that when stratifying the ROC and PR curves by platform and variant type, the curves overlapped almost perfectly, indicating the performance across the two platforms is more-or-less equal (Supplementary Fig. 2e). Together these indicate that the models are well-trained and that feature profiles are describing trends in the data reasonably well.Assessment of pathogenic ClinVar variantsWe next assessed the ability of our model to predict the likelihood of missing clinically relevant variants with either Illumina or Hifi reads. We fed a ClinVar VCF through our model and extracted the probabilities of missing each variant. We also examined the contribution of each feature to those probabilities to “explain” why some variants were likely to be missed. To focus on variants of more clinical interest, we only considered variants marked as “pathogenic” or “likely pathogenic” in the ClinVar VCF. We then bisected variant predictions to those greater or less than 90%, with the former being deemed “detected” and the latter “missed”. The models were well-calibrated around this 90% cutoff, indicating that our cutoff should roughly correspond to the real-world likelihood of missing a variant (Supplementary Fig. 1a). Note that the vast majority of variants were above this 90% threshold (Supplementary Fig. 2b, c) thus the variants below this cutoff represented the lower tail of the hardest variants to call.We then plotted missed variants where Illumina and Hifi predictions disagreed (Fig. 3a) and agreed (Fig. 3b) with each other along with the relative contributions from each feature. We found that Hifi had many fewer variants lower than our 90% cutoff for both INDELs and SNVs (Fig. 3a). Furthermore, variants for which Illumina and Hifi disagreed had different features driving their predictions. Illumina’s predictions were largely driven by segmental duplications, mappability features, tandem repeats, and indel length (in the case of indels). The few errors in Hifi by contrast were mostly in homopolymer regions. For cases where Illumina and Hifi both missed variants, the feature contributions were largely similar (Fig. 3b). We also noted that Hifi tended to assign higher probabilities overall, particularly in the case of SNVs where Hifi was as confident as Illumina or more but the reverse was not true (Fig. 3c). For INDELs this was also largely true except for SNVs >90% where Illumina was sometimes more confident than Hifi. Together these indicate that while Hifi is overall much less likely to miss variants (particularly in hard-to-map regions) when using DeepVariant with a standard linear reference, it still might miss errors in homopolymers that would otherwise not be missed with Illumina.Fig. 3: Assessment of ClinVar variants predicted to be missed using Illumina and HiFi calls from DeepVariant.a, b Heatmap of variants with a 10% or higher likelihood of being predicted as missed unshared between Illumina/Hifi (a) or shared (b) showing contributions of each feature within the model. Note that all interaction terms were added to their univariate main effect features to simplify visualization, and also note that negative scores contributed positively to a missed variant. c Predicted probabilities between Illumina and Hifi models. Red dotted red line is equal probability for Illumina and Hifi. Blue dotted lines are 90% probability cutoffs for either platform d ClinVar variant that is correctly predicted to be missed in Illumina-DeepVariant and as a TP in HiFi-DeepVariant due to being in a difficult-to-map region in a segmental duplication in PI4KA.Most predicted missed pathogenic and likely pathogenic variants were explained both by low mappability and being in a highly similar/duplicated segmental duplication, with more than 20 FN SNVs and INDELs predicted for Illumina-DeepVariant and/or HiFi-DeepVariant in genes PKD1, PMS2, NF1, CYP21A2, CHEK2, NEB, FLG, SDHA, ABCC6, SMN1, STRC, KMT2D, CYP11B1, CDKN1C, PIK3CA, PROSS1, HYDIN, and TUBB2B, as well as genes that are falsely duplicated in GRCh38 (CBS, with fewer FNs in KCNE1, U2AF1, and CRYAA), with full list of genes and counts of predicted FNs in Supplementary Data 1 and Supplementary Note 1. This is plausible given that these two region types should overlap significantly and also should lead to mismapped reads and hence missed variant calls. Interestingly, HiFi-DeepVariant had at least 5-fold fewer predicted missed variants in all of these genes except CBS, CYP21A2, NEB, ABCC6, and HYDIN, which were predicted to be challenging for both technologies. However, this finding may not apply to all methods, such as methods that mask the falsely duplicated regions in GRCh38 that cause mapping challenges in CBS and other genes.In addition to segmental duplications, some variants predicted to be missed were related to INDEL length, tandem repeats, and other difficult-to-map regions, indicating that multiple error mechanisms may be at play. For example, potentially missed variants were predicted for INDELs in VNTRs in ACAN, COL6A1, F7, MYO7A, AGRN, and CDKN1C (some of which are inside an exon and some partly exonic but mostly intronic), as well as large insertions in trinucleotide tandem repeats like DMPK, PHOX2B, and RUNX2, where expansions are associated with disorders. The remaining predicted missed variants were large INDELs in non-repetitive regions (a few incidentally in homopolymers), except for a cluster in a few hundred bp regions of ANKRD11 that is identical to a region on chrX but can be mapped accurately with most paired-end reads. The full list of genes and counts of predicted missed variants is in Supplementary Data 1.Many of the above-identified genes, tandem repeat regions, and variant types have previously been known to be challenging. In addition to highlighting general genes and characteristics of variants, this model predicted particular pathogenic variants that may be missed by a particular sequencing and bioinformatics method (in this case, 35× Illumina-DeepVariant PCR-free WGS), which may not be representative of all Illumina-based methods. This model could be used similarly to predict variants that might be missed by any method that has been used to call variants from benchmark samples like those available from GIAB.Use case 2: comparing linear vs graph-based mappersWe next hypothesized that StratoMod’s feature set would enable it to differentiate the performance of different mappers, particularly linear and graph-based mappers because graph-based mappers have been developed to improve results in difficult-to-map regions and variants24,25.To test this, we trained StratoMod using two Element DeepVariant callsets which used either BWA mem26 (linear) or VG27 (graphical) mappers prior to running DeepVariant. The model was trained using TP as the positive class and FN as the negative class and thus learned to predict where variants were most likely missed. Between VG and BWA, the models were generally similar except for a few key features that corresponded to hard-to-map regions, and the performance/calibration of each model was reasonable (Fig. 4a, b, Supplementary Fig. 3a). As expected, VG’s error rate increased less than BWA’s when moving from non-hard-to-map vs hard-to-map (either 100 or 250 bp long) for both SNVs and INDELs. Furthermore, VG’s error rate increased less with increasing LINE length for both SNVs and INDELs, corresponding to the intuition that longer LINEs should be harder to map. Finally, in the case of INDELs BWA’s error rate dropped when similarity increased above 92% while VG dropped less at the same point, again corresponding to the intuition that more similar segmental duplications are harder to map. Each of these depicted features was representative of a key subset of features whose main effect and interaction with the two mappers contributed to a large component of the model’s variability (Supplementary Fig. 3b).Fig. 4: StratoMod correctly predicts the advantage for VG in hard-to-map regions.a, b Prominent hard-to-map feature profiles. Error bars and ribbons around step plots are model errors. c Prediction correlation plot between VG and BWA where either BWA or correct and VG is incorrect (top) or the reverse (bottom). Dotted line y = x (perfect correlation). Note these are only for SNVs but the INDEL plots were almost identical. “Hard regions” included LINEs, SINEs, LTRs, segdups, and hard-to-map regions 100 and 250 bp long d IGV screenshot depicting an instance where VG is correct and BWA is incorrect. In this case, the variant appears in this copy of the segdup but not others for a large fraction of haplotypes in the graph (bottom track).We next asked what distinguished the variants for which the two mappers disagreed. We plotted StratoMod’s predicted recall for cases where BWA was correct and VG was incorrect and vice versa (Fig. 4c, Supplementary Data 3, 4). These probabilities were reasonably calibrated for both SNVs and INDELs (Supplementary Fig. 3c). Interestingly, in either case using VG often led to higher predicted recall than BWA (by as much as 50%), even when it was incorrect according to our benchmark. The reverse (using BWA lead to higher predicted recall than VG) appeared to never be true. When stratifying these plots by those in “hard” regions (LINEs/SINEs (short interspersed nuclear element)/LTRs (long terminal repeats)/Segdups/Hard-to-map 100 and 250 bp long), we saw that this increased predicted recall was due to VG performing better in these particular region types.We next manually investigated cases where the two mappers disagreed, and highlighted a representative case where VG was correct and BWA was incorrect due to a segmental duplication (Fig. 4d). This variant was present in HG002’s genome (top benchmark track) but the self-chain alignment of the other copy of this segmental duplication to this region shows that it matches this variant, causing Element reads with the variant to map to the wrong copy using a linear reference. However, because the variant is also present in most of the HPRC assemblies in the graph (bottom track), VG was able to leverage this graph to accurately map reads to this region rather than the other copy of the segmental duplication.Taken together, these demonstrate how StratoMod can be used to compare and assess the performance of different mappers given genomic regions with different mapping characteristics.Other use casesAssessment of new sequencing technologiesNew sequencing technologies are under active development, and methods to understand the strengths and weaknesses of these technologies are important, so we next tested if StratoMod could be used to measure the progression of Ultima’s new sequencing platform that promises to be less expensive and has been steadily improving up to their first product launch in 202428. We compared their latest callsets (“R2024”) from the UG100 to those from 2022 (“R2022”), both called by DeepVariant. These models were trained analogously to those in use case 1 (same feature set, benchmark, and genome) and were trained to predict recall.We particularly investigated homopolymer errors since these features were dominant in the model’s predictions (Supplementary Fig. 4a, b). For AT homopolymers, the error rate climbed precipitously after 8 and 10 bp for SNV and INDEL respectively. For GC homopolymers the error rate rose almost immediately after 4 bp (which was our minimum for defining a homopolymer). Furthermore, Ultima’s latest iteration had notably higher overall accuracy compared to their previous iteration, as noted by the translational increase in the R2024 curve relative to the R2022 curve in each of the plots and the “VAR_seqtech” variable (which is the categorical variable distinguishing the two releases) being near the top in the feature importance plots (Supplementary Fig. 4b). Subtly, their newer iteration showed a slight advantage for longer homopolymers (particularly INDELs) where the distance between the curves grew with increasing homopolymer length.In total, these data demonstrate how StratoMod can be used to assess the progress of emerging technologies; in this case, the overall error rate dropped considerably for the latest Ultima iteration, with subtle performance differences conditional on region type (such as long homopolymers).Assessment of variant calling pipeline improvementsIn addition to the platforms themselves, the software used to call variants from data created on these platforms is constantly improving. We used StratoMod to assess improvements in predicted recall for different versions of guppy and clair, the base and variant caller respectively for the ONT sequencing pipeline. Specifically, we assessed guppy4+clair1 and guppy5+clair3 in combination. These were trained analogously to use case-1 except that we used HG003 as the benchmark.We again investigated homopolymer errors since this is a well-known error modality for ONT callsets. Overall error rates were much lower for the newer software as expected (Supplementary Fig. 5, Supplementary Note 3) with the improvement being much more pronounced for INDELs. In the case of SNVs, much of the performance improvements were independent of homopolymer length and imperfect fraction. However, the INDELs, the predicted recall for the older callers degraded much more rapidly for homopolymers longer than 8 bp and 6 bp for AT and GC respectively (Supplementary Fig. 5a). The gap between the older callers and the newer callers decreased with increasing imperfect fraction, especially for AT homopolymers (Supplementary Fig. 5b). Despite this, the older callers were still inferior to the newer callers by at least a 5x margin for each imperfect fraction. These suggest that the homopolymer-specific error mechanism in the older ONT callers (which the newer callers improved) was due to miscounting longer homopolymers. This would affect INDELs more than SNVs since vcfeval will only count a variant if it matches completely. This also would be somewhat negated in homopolymers with higher imperfect fractions since imperfect bases “interrupt” a stretch of otherwise similar bases as they are going through the nanopore, which may make counting the number of bases easier. In fact, one approach ONT has announced to improve accuracy in homopolymers (the 6b4 chemistry) is to modify bases inside the homopolymer.Together, these indicate how StratoMod is able to assess improvements in variant/base callers and provide insight into the mechanism for these improvements.Model validation: comparison between ClinVar predictions and gnomADWe finally sought to validate the use of our model’s predictions of ClinVar variants that may be missed or erroneously called by short reads. Given that no benchmark exists for most ClinVar variants from which we can derive “labels” (TP, FP, etc), we hypothesized that the probability of our model should be correlated with the likelihood of a variant in gnomAD v4.0 being filtered. gnomAD may filter a variant for similar reasons that StratoMod may assign a low probability, albeit with a different model and different input data. We trained similar models to those used in “use case 1” except we also included FP (along with FN) as the error label since errors in gnomAD could correspond to either an FP or an FN error in StratoMod (note that this is the Jaccard index, or TP/(TP + FN + FP) (Fig. 1c)). Previously we limited ourselves to FN to demonstrate StratoMod’s ability to predict recall. We also trained StratoMod using an Illumina callset to be comparable to gnomADs underlying sequencing data, albeit with different variant calling methodologies (both used BWA mem, and the training set used DeepVariant). Thus the interpretation of StratoMod’s output is “the probability that ClinVar variants of interest will be correctly called using Illumina-BWAMEM-DeepVariant.” Many ClinVar variants did not intersect with any gnomAD variants (Supplementary Fig. 13a); since many missing variants are likely very rare and not in any gnomAD samples. While some of the missing variants are in challenging regions, many are not, so the rest of the analysis only concerns the fraction of ClinVar variants that intersected with either a filtered or passing gnomAD variant.We first created “agreement plots” (similar to calibration plots in machine learning) which depict the binned fraction of gnomAD non-PASS variants vs the mean predicted error rate (which is 1 – StratoMod’s output) for that bin (Fig. 5a). In general, StratoMod and gnomAD corresponded reasonably well, with StratoMod predicting slightly fewer errors than gnomAD’s filtering (i.e. assigned a lower probability than the fraction of non-PASS variants) for the majority of variants for both SNVs and INDELs. Note that for both, the curves were noisier moving toward the lower-left corner since the number of variants in each bin decreased with decreasing probability. To explain why Stratomod predicted fewer errors, we further hypothesized that gnomAD is filtering true variants with low allele count (AC) in a region type that StratoMod predicted would be easier than the number of filtered variants would suggest. We tested this by stratifying the agreement plots by allele counts greater than/less than 10 (red vs blue lines) and indeed noted that for the majority of variants, StratoMod predicts fewer errors for low allele counts.Fig. 5: Validation of ClinVar model predictions using gnomAD.a Agreement plots between model predicted error rate and fraction of gnomAD non-PASS variants using 20 bins and stratified by gnomAD allele count. Each dot is a binned model probability, where the x position is the mean probability and the y position is the fraction of non-PASS in that bin. The “x” points are all variants that had no features in the model (and thus their probabilities are nearly identical) and the remaining points are variants with at least one feature. Red dotted line is y = x (perfect agreement). Data shown does not include the bottom 5% of probabilities, as these were sparse enough to appear as noise. b Correlations plots between mean model probability and a fraction of non-PASS variants in segmental duplications (SegDups). Each point represents one SegDup region and only regions with >10 variants are included. c Probability histograms for gnomAD PASS and non-PASS variants stratified by allele count (shown as an order of magnitude, so “AC 10” means “allele counts between 10 and 99”). Dotted red line is the median model probability for each histogram.We further tested this hypothesis by curating some of the low allele count variants in the gnomAD browser. We found that these variants mostly fell in two categories: (1) variants in only one or two whole genome samples that did not have evidence of mapping or sequencing errors so appeared likely to be true, and (2) variants in homopolymers that were likely sequencing errors because they only occurred in one direction (Supplementary Fig. 14). The latter variants were likely predicted to have a low error rate because StratoMod features do not include characteristics of the reads such as strand bias. Furthermore, our model is designed to distinguish between true variants and errors in Illumina-DeepVariant calls, whereas gnomAD’s GATK VQSR-based filtering is designed to remove likely false variants in population callsets from 100,000’s of individuals. In summary, these curated examples explain nuanced differences due to rare variants and feature sets used, but the model is generally consistent with gnomAD filtering despite these differences, as shown in Fig. 5a.Furthermore, the “x” points in the plots denote variants that intersect with no features corresponding to challenging regions in StratoMod (or only INDEL length in the case of INDELs), and thus have nearly the same probability. Surprisingly a large number of variants with features had higher predictions than those without, which was likely due to our feature set not including all possible features that could be used to measure “difficulty”, and the inclusion of some features correlated with higher rates of true variants like short homopolymers (Supplementary Note 2 and Supplementary Fig. 15). These also followed the same pattern, with lower allele counts corresponding to StratoMod being more overconfident.Note that in the case of SNVs, the majority of variants were in the lower-left corner (note the large dot, denoting most of the variants are in this range), which shows gnomAD had more filtered variants than predicted by StratoMod for AC < 10 vs ≥10, in line with that for INDELs (albeit with less magnitude). However, for SNVs with probabilities between 0.025 and 0.075, gnomAD had fewer filtered variants than predicted. We hypothesized that this shift reflected a confound in the type of region represented between these different allele count stratifications and within these probabilities. To test this, we assessed the fraction of each region type (tandem repeat, segmental duplication, etc) in each bin and noted that for SNVs, segmental duplications are over-represented for variants with AC < 10 vs ≥10 for the range of probabilities between 0.025 and 0.075 (Supplementary Fig. 13b). Because StratoMod assigns a lower probability to variants in segmental duplications vs those not, this explains why gnomAD has more filtering than predicted for the AC > = 10 curve (i.e. StratoMod thinks these variants are “easier” as they are not in segmental duplications) vs the AC < 10 curve despite gnomAD more aggressively filtering variants with low allele count.We also assessed the fraction of gnomAD non-PASS variants in segmental duplications vs StratoMod’s predicted error rate (Fig. 5b). We observed a good correlation between gnomAD and StratoMod for both INDELs and SNVs. Note that in these plots we are only showing segmental duplications with 10 variants or more (regions with few variants tended to be noisier since they have fewer variants from which to estimate the x and y mean). In agreement with the segmental duplications discussed above in Fig. 5a, in the case of SNVs we observed several segmental duplications for which StratoMod predicted fewer errors compared to gnomAD in the 90–100% range. Furthermore, unlike Fig. 5a which only depicts the top 95% of variants (otherwise bins in the lower probability range would be too noisy given how little data they would contain), Fig. 5b shows variants in segmental duplications for which the StratoMod probability was less than 80%, and indeed shows that even for these lower ranges, there was high agreement between StratoMod and gnomAD.Finally, we further assessed the relationship between allele count and StratoMod’s probability by plotting histograms stratified by PASS/non-PASS and increasing allele count (Fig. 5c). We observed that probability is near 1 for the vast majority of variants in the PASS case and not so for the non-PASS case, in agreement with Fig. 5a, b. Furthermore, the median probability (dotted red line) appears to decrease with increasing allele count. This effect is very subtle in the case of SNVs (dropping by ~2x log-odds from the lowest to highest) and much higher in the case of INDELs (~8x drop from lowest to highest), also in agreement with the apparent magnitude between low and high allele counts between SNVs and INDELs in Fig. 5a.

Hot Topics

Related Articles