VolcanoSV enables accurate and robust structural variant calling in diploid genomes from single-molecule long read sequencing

We first investigated SV detection using 4 assembly-based methods (VolcanoSV (v1.0.0), PAV (freeze2), SVIM-asm (v1.0.2), and Dipcall) in 14 PacBio Hifi, CLR, and ONT datasets, 9 simulated long reads datasets, and two paired tumor-normal CLR and ONT datasets. For Hifi data, three assembly-based SV callers (PAV, SVIM-asm, and Dipcall) could use as input the diploid assembly result of hifiasm (v0.16)26. For CLR and ONT data, Flye (v2.9-b1768)27 plus HapDup (v0.5-iss10)28 were used to generate a dual assembly for the three assembly-based tools. We selected hifiasm and Flye plus HapDup to generate assembly since they provided the best assembly results for SV calling29. VolcanoSV employed its own haplotype-aware assembly component (VolcanoSV-asm) to produce a diploid assembly. To further demonstrate VolcanoSV’s robust performance across different SV evaluation thresholds, we compared SV calls from four assembly-based methods in terms of breakpoint identification and SV sequence accuracy. Among the 14 long-read sequencing datasets (Table 1), five PacBio HiFi datasets were referred to as Hifi_L1, Hifi_L2, Hifi_L3, Hifi_L4, and Hifi_L5. They had approximately 56×, 30×, 34×, 28×, and 41× coverage, respectively. Three PacBio CLR datasets were referred to as CLR_L1, CLR_L2, and CLR_L3 and their coverage was 89x, 65x, and 29x, respectively. We also used six ONT datasets referred to as ONT_L1, ONT_L2, ONT_L3, ONT_L4, ONT_L5, and ONT_L6. Their coverage was approximately 48×, 46×, 57×, 36×, 47×, and 51×. More information for each SV caller and dataset is provided in Table 1. VolcanoSV utilizes a reference genome and long-read data to generate a high-quality haplotype-resolved diploid assembly. Using this assembly, all types of variants are comprehensively detected. The VolcanoSV pipeline is illustrated in Figs. 1 and 2. A detailed description is provided in the Methods section.Table 1 Resource for different tools and long-read datasetsFig. 1: VolcanoSV overall workflow.The main workflow of VolcanoSV consists of two key components, VolcanoSV-asm and VolcanoSV-vc. VolcanoSV-asm (left, blue square) comprises three conceptual modules to perform diploid assembly (partitioning reads into corresponding haplotypes, assigning unphased reads, and performing a haplotype-aware local assembly). The output of this component is processed by the VolcanoSV-vc component (center, red rectangle) to perform variant detection. Further details are provided in the Methods section.Fig. 2: VolcanoSV-vc workflow.VolcanoSV-vc includes three main modules: a large indel SV detection, b complex SV detection, and c small indel detection. The output of this component is a phased VCF file. Further details are provided in the Methods section.VolcanoSV exhibits exceptional performance across 14 real long-read datasetsTo assess the performance of insertion and deletion SV detection, we applied four assembly-based tools, VolcanoSV, PAV, SVIM-asm, and Dipcall, across 14 long-read libraries of HG002. We evaluated their results against the Genome in a Bottle (GIAB) SV gold standard23. An SV benchmarking tool, Truvari (v4.0.0)30, was employed to compare the SV calls of each tool with the GIAB SV gold standard. Truvari evaluates SVs within Variant Call Format files (VCF) by analyzing four essential similarity metrics (reference distance, reciprocal overlap, size similarity, sequence similarity) across all SV pairs within a designated region, while also ensuring SV type and genotype match between compared SV pairs. If any of these metrics exceed user-defined thresholds, the SV pair fails to be a candidate match. The following metric/parameter setting was used in Truvari: p = 0.5, P = 0.5, r = 500, O = 0.01, representing a moderate-tolerance metric/parameter set to evaluate SV calls. Specifically, parameter p, or pctstim, ranging from 0 to 1.0, controls the minimum allele sequence similarity used to identify the SV calls being compared as the same. Parameter P, also known as pctsize, ranges from 0 to 1.0, defining the minimum allele size similarity required between the two SVs. Parameter O, also referred to as pctovl, ranges from 0 to 1.0 and determines the minimum threshold of reciprocal overlap ratio between the base and comparison call. It is only applicable on deletions that can be used to evaluate their breakpoint shift. Parameter r, or refdist, ranging from 0 to 1000 bp, limits the threshold for maximum reference location difference of the SVs being compared, which can be used to evaluate the breakpoint shift of insertions.We first determined the average performance across different PacBio Hifi, CLR, and ONT datasets for the four assembly-based tools (Table 2). Across Hifi datasets, VolcanoSV achieved the best average F1 (91.03% and 94.19%) and genotyping accuracy (98.32% and 99.01%) for insertions and deletions. Across CLR datasets, VolcanoSV achieved the best average F1 (89.72% and 93.70%) and genotyping accuracy (97.07% and 98.58%) for insertions and deletions. In ONT datasets, VolcanoSV also attained the best average F1 (90.10% and 93.13%), and genotyping accuracy (98.00% and 99.06%) for insertions and deletions.Table 2 Performance across 14 datasets for three different types of long-read dataWhen we examined each dataset (Fig. 3, Table 2, and Supplementary Tables 1–3), VolcanoSV consistently outperformed all other tools, achieving the highest F1 scores for both insertions and deletions for all 14 libraries. In the context of the five Hifi datasets (Fig. 3, Table 2, and Supplementary Table 1), VolcanoSV achieved the highest ranking in terms of all performance metrics. Specifically, for insertions, VolcanoSV consistently surpassed all other tools across all metrics, with F1 score, recall, precision, and GT concordance outperforming the second-ranked tool by an average of 1.29%, 0.67%, 1.92%, and 0.59%, respectively. With respect to deletions, VolcanoSV sustained its advantage, demonstrating an average superiority of 1.07% in F1 score, 0.48% in recall, 1.52% in precision, and 0.53% in GT concordance over the second-ranked tool.Fig. 3: Cross datasets evaluation against GIAB H002 benchmark.a, b F1 heatmap for deletions (DEL) and insertions (INS) by four assembly-based tools. c, d Recall bar plots for insertions deletion (DEL) and insertions (INS) by four assembly-based tools. e, f Precision bar plots for insertions deletion (DEL) and insertions (INS) by four assembly-based tools. g, h Genotype accuracy (represented by GT_concordance) bar plots for insertions deletion (DEL) and insertions (INS) by four assembly-based tools. Source data are provided as a Source Data file.In the three CLR datasets (Fig. 3, Table 2, and Supplementary Table 2), VolcanoSV stood out as the top performer across all metrics and libraries, with distinct advantages. For insertions, VolcanoSV’s performance metrics, including F1 score, recall, precision, and GT concordance, were 3.30%, 0.87%, 4.61%, and 4.20% higher than the second-ranked tool. Likewise, for deletions, VolcanoSV outperformed the second-ranked tool by 4.87%, 6.19%, 3.19%, and 1.71% higher scores on average for F1, recall, precision, and GT concordance. It is noteworthy that CLR data exhibited a significantly higher error rate, varying between 10% to 20%. PAV, SVIM-asm, and Dipcall exhibited significantly inferior performance in PacBio CLR when contrasted with Hifi datasets. Effectively eliminating false positive calls is a crucial step following the SV detection process. VolcanoSV incorporates a precise SV filtering procedure and an advanced GT prediction model within its workflow, resulting in a notable enhancement in performance compared to all other tools.For the six ONT datasets (Fig. 3, Table 2, and Supplementary Table 3), VolcanoSV still maintained a substantial lead. In terms of insertions, VolcanoSV outperformed the second-ranked tool, with an average F1 score and precision that were 1.5% and 2.68% higher, respectively. Regarding the insertion recall, on ONT_L3-5, VolcanoSV’s recall was 0.38% higher on average than the second-ranked tool. On ONT_L1 and L6, VolcanoSV exhibited the second-highest recall, with just an average of 0.14% less compared to the top recall. However, on ONT_L2, VolcanoSV only demonstrated the third-highest recall, with 1.03% less compared to the top recall. For the insertion GT concordance, VolcanoSV achieved a 0.29% higher GT concordance than the second-ranked tool, except for ONT_L2, where SVIM-asm reached the highest GT concordance with a margin of 0.16% over VolcanoSV. In terms of deletions, VolcanoSV outperformed the competition with an average F1 score, precision, and GT concordance that were 0.89%, 1.87%, and 0.62% higher, respectively, than the second-ranked tool. In terms of deletion recall, on ONT_L1, L3, L4, and L6, VolcanoSV’s recall was 0.13% higher on average than the second-ranked tool. However, on ONT_L2 and L5, PAV and SVIM-asm attained the best recall, with an increase of 0.18% on average, in comparison to VolcanoSV.In summary, VolcanoSV emerged as the top-tier choice for assembly-based SV detection across different long-read datasets, exhibiting superior performance and consistency, particularly for PacBio HiFi and CLR datasets in terms of F1 score, recall, precision, and GT concordance. VolcanoSV still demonstrated its superiority for ONT datasets in terms of F1 score, precision, and GT concordance. With respect to the recall in insertions and deletions, VolcanoSV achieved the best recall in 3–4 out of 6 datasets.SV annotationAlthough VolcanoSV further improved F1 scores and genotyping accuracy compared to the second-ranked tool in each dataset, the potential benefits impact of this improvement on downstream analyses remained unclear. To delve deeper into these unique true positive (TP) SVs with correct genotypes (GT) identified by VolcanoSV compared to the second-ranked tool, we annotated the SVs with their predicted effects on other genomic features using the Ensembl Variant Effect Predictor (VEP)31.For example, we extracted and annotated 300 unique TP SVs with correct genotypes identified by VolcanoSV compared to the second-ranked tool, SVIM-asm, in Hifi_L1. These SVs overlapped with 258 genes, 939 transcripts, and 120 regulatory features, including 266 novel SVs and 34 existing ones. Among these, 125 SVs were coding sequence variants and 16 were in-frame deletions. More information on the consequences of these SVs is demonstrated in Supplementary Fig. 1. Additionally, we performed an exact match comparison of these SVs in terms of sequence and breakpoints with those in the Genome Aggregation Database (gnomAD)32. We identified 29 matching SVs, 12 of which are rare variants with an allele frequency (AF) of less than 1%. Furthermore, we found that these SVs allowed an additional 85 genes to be phased by VolcanoSV. Genes are considered phased if all heterozygous variants within it are phased.We performed these analyses for all 14 long-read datasets (Supplementary Table 4 and Supplementary Fig. 1), and the results suggested that the improvement by VolcanoSV over the second-ranked tool in each dataset provided a substantial number of SVs with potential phenotypic effects.VolcanoSV unveils numerous unique true structural variants and remains robust across SV sizesAs we observed in the previous section, VolcanoSV achieved the highest recall in all PacBio Hifi and CLR datasets, and half of the ONT datasets. To further assess the overlapping calls among different tools and understand why VolcanoSV generated many more true positive calls, we employed an UpSet plot to visualize the total number of shared true positive (TP) calls and unique TP calls for all four assembly-based tools (Fig. 4a–c). Overall, the four tools we examined exhibited a substantial overlap in TP calls, as indicated by the rightmost bar in the plot. Specifically, across Hifi_L2, CLR_L1, and ONT_L1 data, these four tools shared 8457, 7834, and 8151 TP calls, respectively. Notably, VolcanoSV contributed the largest number of unique TP calls. For Hifi_L2, VolcanoSV, PAV, SVIM-asm, and Dipcall had 43, 35, 4, and 1 unique TP calls, respectively. In the case of CLR_L1, these tools had 174, 28, 11, and 4 unique TP calls, while for ONT_L1, they had 56, 19, 5, and 2 unique TP calls. The SV annotation analysis for the unique SVs by VolcanoSV was illustrated in Supplementary Table 5 and Supplementary Fig. 2.Fig. 4: Overlapping calls, size distribution, and accuracy for SV discovery and complex SV analysis.a–c UpSet plot for analysis of shared and unique true positive (TP) calls between different assembly-based tools. d–f F1 accuracy of SV detection at different size ranges. The negative size range represents deletions and the positive size range represents insertions. The bar plot shows benchmark SV distribution at different size ranges. The line plot shows the F1 score of four different methods. g F1 and GT_F1 heatmap for complex SV detection on simulated data. h The recall heatmap for complex somatic SV detection on real data. Source data are provided as a Source Data file.To assess the influence of structural variant (SV) size on the accuracy of SV detection, we generated F1 scores for SV detection across different SV size ranges (Fig. 4d–f). In general, VolcanoSV demonstrated top-tier accuracy across most size ranges, except for insertions within the 10–50 kb range. Specifically, in Hifi_L2, VolcanoSV consistently excelled in medium and large-size deletions (50 bp–6 kb) and insertions (50 bp–4 kb). When it came to very large insertions (6–50 kb), VolcanoSV’s performance exhibited some variability. However, in the case of very large deletions (6–50 kb), VolcanoSV consistently maintained top performance, with the exception of deletions within the 6–7 kb range. Notably, in the case of extremely large deletions (10–50 kb), VolcanoSV outperformed other tools significantly in terms of F1 scores.The performance of VolcanoSV was even more remarkable for CLR and ONT data. On CLR_L1 and ONT_L1, VolcanoSV achieved the highest F1 scores for deletions across all size ranges and insertions within the 50 bp–5 kb range. However, for very large insertions (5–50 kb), VolcanoSV’s performance showed some variability. It is worth highlighting that in CLR data, VolcanoSV displayed notably superior performance in detecting very large deletions (7–50 kb) compared to other tools. Upon further analysis of the remaining 11 datasets, VolcanoSV demonstrated consistent and superior performance across different size ranges compared to the other three assembly-based tools (Supplementary Fig. 3).VolcanoSV demonstrates proficiency in identifying complex SVs in simulated and real cancer datasetsUp to this point, we have assessed deletion and insertion SVs, which account for most SVs, but other SVs such as translocations (TRA), inversions (INV), and duplications (DUP) also describe different combinations of DNA rearrangements. The lack of benchmarking data for such complex SVs makes it difficult to evaluate tools. To extend the evaluation to complex SV detection, we first applied VolcanoSV and other tools to simulated data. Dipcall was not involved in this analysis since it was not designed to detect complex SVs.We used VISOR (v1.1.2)33 to insert SNPs and complex SVs into the hg19 human reference, and then used PBSIM (v3.0.0)34 to simulate reads for different libraries (Hifi, CLR, and ONT). We compared the SV call results with the initial inserted SVs to calculate F1 and F1 for genotyping accuracy (GT_F1) scores. Detailed simulation and evaluation details are described in the Methods section. Since every reciprocal TRA has four breakends, it is hard to decide whether the genotype is correct or not. We thus only calculated F1 and GT_F1 for INVs and DUPs. In Fig. 4g, VolcanoSV demonstrated the highest F1 performance across all simulated libraries and complex SV types, except for TRAs in Hifi data (Hifi_TRA). Specifically, on CLR and ONT data, VolcanoSV surpassed SVIM-asm, achieving F1 scores that were on average 18%, 23%, and 63% higher for TRAs, INVs, and DUPs, respectively. Moreover, VolcanoSV’s GT_F1 scores were on average 33% and 57% higher than SVIM-asm’s GT_F1 for INV and DUP, respectively. For TRAs in Hifi data, VolcanoSV’s F1 was only 1% lower than the best F1 by SVIM-asm. PAV was not designed to detect TRAs and DUPs but also had the worst performance on INVs. Noticeably, VolcanoSV achieved much higher F1 and GT_F1 scores for DUPs in all simulated data compared to SVIM-asm, primarily attributed to the effective duplication recovery process implemented in VolcanoSV. With respect to GT_F1, VolcanoSV achieved the best performance on all simulated libraries and complex SV types, except for INVs in Hifi data (Hifi_INV).We next applied VolcanoSV and SVIM-asm on two publicly available sets of tumor-normal paired libraries (Pacbio CLR and ONT) provided by Talsania et al.35 and the high confidence HCC1395 somatic SV callset they provided as the benchmark, to further evaluate three classes of somatic complex SVs. To detect somatic SVs, we first applied each assembly-based tool on every library to generate VCF files independently, then used SURVIVOR (v1.0.6)36 to call somatic variants based on the paired normal-tumor VCFs, and finally compared the somatic SV result to the provided benchmark callset. The high-confidence benchmark callset has a total of 1777 SVs, including 551 insertions, 717 deletions, 146 translocations, 133 inversions, and 230 duplications. Since this high-confidence callset is incomplete, we only plotted recall. VolcanoSV outperformed SVIM-asm in terms of recall across all different libraries and all complex SV types, especially for DUPs (Fig. 4h). Specifically, on PacBio CLR data, VolcanoSV’s recall was 23%, 18%, and 12% higher than SVIM-asm for TRAs, INVs, and DUPs respectively. On ONT data, VolcanoSV’s recall was 13%, 9%, and 17% higher than SVIM-asm for TRAs, INVs, and DUPs respectively. Although VolcanoSV outperformed SVIM-asm in detecting complex SVs, its recall was relatively lower compared to its performance for simulated data or indel SVs. To understand the reason behind this low recall and whether VolcanoSV predominantly detects common complex SVs identifiable by other long-read datasets, we analyzed the overlapping VolcanoSV calls between paired normal and tumor HCC1395 samples, and HG002. Detailed results and discussion are provided in Supplementary Fig. 4 and Supplementary Notes 1.1. Our analysis revealed that 1) VolcanoSV failed to detect a sufficient number of unique somatic complex SVs in the tumor sample, in addition to the germline SVs; 2) The high-confidence benchmark callset included SV calls only from alignment-based tools, and therefore using it to evaluate VolcanoSV might introduce bias. Overall, it is still challenging to solely use assembled contigs to detect complex SVs due to the limitations of assembly algorithms and the complexity of graph construction. Many genome assembly algorithms build contigs by following the simplest paths through overlapping reads, which may miss complex SVs. These variants create irregular patterns that do not fit into the straightforward paths the algorithms usually prefer or disrupt the continuity of the graph.VolcanoSV maintained superior performance on low coverage datasetsTo further benchmark VolcanoSV’s robustness to different sequencing coverages against the three assembly-based SV callers (PAV, SVIM-asm, and Dipcall), we evaluated SV calling results on subsampled Hifi_L1, CLR_L1, and ONT_L1. All three datasets were subsampled to 40x, 30x, 20x, 10x, and 5x coverage using rasusa (v0.6.0)37. Hifi_L1 and CLR_L1 were additionally subsampled to 50x coverage due to their higher original coverage.The four tools exhibited similar subsampling effects on Hifi_L1 (Fig. 5a and Supplementary Table 6). Noticeable changes in terms of recall, precision, and F1 were only observed after the coverage dropped to 5x. For both insertions and deletions, VolcanoSV showed greater robustness to subsampling effects on Hifi_L1 compared to the other three tools, since it could still preserve high F1 scores at 10x coverage. At 5x coverage, SVIM-asm kept the highest F1, followed by VolcanoSV.Fig. 5: Subsampling effect for different methods.a Recall-precision-F1 curves show the subsampling effect on deletion and insertion by different tools on Hifi_L1. b Recall-precision-F1 curves show the subsampling effect on deletion and insertion by different tools on CLR_L1. c Recall-precision-F1 curves show the subsampling effect on deletion and insertion by different tools on ONT_L1. The coverage depth varies from 5×, 10×, 20×, 30×, 40× to 50×. Solid lines with markers are for different coverage depths, and corresponding dashed lines are for genotyping (gt) accuracy. For both insertions and deletions, we zoom in on the top right part of the plot to demonstrate the curves more clearly. Source data are provided as a Source Data file.Two main patterns were observed in the subsampling effects on CLR_L1 and ONT_L1 (Fig. 5b, c, Supplementary Table 7, and Supplementary Table 8). VolcanoSV and Dipcall showed relatively stable precision across different coverages, but lower recall at lower coverages (5–10×). On the other hand, PAV and SVIM-asm were able to maintain relatively better recall as coverage decreased, however, their precision declined quickly as coverage dropped to lower coverages. Considering F1 as the metric, VolcanoSV still demonstrated greater robustness against subsampling on CLR_L1 and ONT_L1, as it exhibited better F1 scores compared to the other three tools when the coverage dropped to 10× and 5×.We also examined subsampling effects on genotyping accuracy. The genotyping performance shared a similar trend with the overall accuracy for the subsampling effects. Across all coverages (5–50×), VolcanoSV maintained the best genotype accuracy compared to the other three tools.VolcanoSV is robust to SV evaluation parametersDue to the complex nature of structural variants, SV benchmark tools such as Truvari usually do not require a detected SV to have exactly the same breakpoints and sequence as the true SV to be considered correct, but instead use a set of evaluation parameters to control the matching tolerance. While the matching tolerance can facilitate fair SV comparisons, the choice of the evaluation parameters is usually empirical and subjective, which increases the uncertainty of the comparison outcomes and could limit our understanding of the SV calling performance. A lenient threshold might underestimate distinctions among SV callers, while a stringent threshold could exaggerate the superiority of a particular SV caller. Therefore, we used a grid search experiment to explore Truvari’s parameters to thoroughly investigate the robustness of VolcanoSV and other assembly-based tools38. As described before, evaluation parameters include pctstim (p), pctsize (P), pctovl (O), and refdist (r). In general, higher values of p, P, and O, along with lower values of r, establish more stringent comparison criteria. This means that the SVs being compared will need to exhibit greater sequence similarity, allele size similarity, reciprocal overlap ratio, or closer proximity to the reference sequence in order to be classified as the same SV. More descriptions of Truvari and its parameters are provided in the Methods section. Specifically, in our grid search SV evaluation experiments, we varied p, P, O from 0 to 1 in increments of 0.1, and r from 0 to 1 kb in increments of 100 bp.We first evaluated deletions on Hifi_L1. Parameters p and O had the most significant effects. As illustrated in Fig. 6a, when p and O increased (i.e. a more stringent correspondence was required between the call and the gold standard to be accepted as a true positive), all four assembly-based tools, Dipcall, SVIM-asm, PAV, and VolcanoSV demonstrated stable and high performance across the grid search and could still maintain reasonable F1 scores under the most stringent parameter settings. Among them, VolcanoSV showed the most robust performance with the changes of evaluation parameters and maintained the highest F1 score across all entries. Parameters P and r had little effect on deletion evaluation for all methods unless they were set to 1.0 or 0 bp, respectively.Fig. 6: F1 accuracy by tuning different evaluation parameters and distribution of breakpoint shift and alternate allele sequence similarity for SVs on Hifi_L1.a Grid search heatmap of F1 values for deletions by different assembly-based tools. b Distribution of breakpoint shift for deletions by assembly-based tools. c Distribution of alternate sequence similarity for deletions by assembly-based tools. d–f Equivalent visual representations as shown in a–c for insertions. Source data are provided as a Source Data file.In terms of evaluating insertions on Hifi_L1, parameters p and r were chosen as the representative pair to compare performance across methods. Compared to deletions, insertions called by assembly-based tools were slightly more robust to the changes of evaluation parameters (Fig. 6d), except for the most stringent conditions (p = 1.0 or r = 0). SVIM-asm, PAV, and VolcanoSV, maintained relatively higher performance across grid searches than Dipcall. Overall, VolcanoSV achieved higher F1 than SVIM-asm and PAV for each grid. The parameter O was not used since it is not applicable to insertion evaluation. The parameter P only had a significant effect on insertion evaluation for SV callers when it was set to 1.0. This result was consistent with its effect on deletion evaluation. We have also observed similar patterns on grid searches for all methods on CLR_L1 and ONT_L1 (Supplementary Figs. 5 and 6).To reveal why assembly-based methods like VolcanoSV were robust to changes of evaluation parameters, we further analyzed the distribution of SV breakpoint and alternate allele sequence similarity for several representative tools. The breakpoint shift was determined by calculating the maximum difference in reference locations between the two compared SV calls. The similarity in SV sequences was computed based on the edit distance between the two compared SV calls. The results showed that all four tools had a near zero breakpoint shift and near 100% SV sequence similarity with the benchmark callset (Fig. 6b, c, e, f). This result underscores that the ability to accurately capture SV breakpoints and alternative allele sequences contributes to the tool’s resilience in rigorous SV evaluation scenarios, a capability possessed by all assembly-based methods.VolcanoSV-vc attains low false discovery rateWe employed the complete VolcanoSV pipeline to produce diploid assembly, ultimately facilitating the detection of all types of structural variants. VolcanoSV-vc, serving as the assembly-based SV calling component of VolcanoSV, is versatile enough to function as a standalone tool, accepting assembly inputs from other assemblers. We thus investigated the SV calling performance of VolcanoSV-vc by taking hifiasm’s assembly as input for Hifi data, and Flye + HapDup’s assembly as input for CLR and ONT data. In comparison to the three other assembly-based methods (PAV, SVIM-asm, and Dipcall), VolcanoSV-vc achieved the best F1 across all 14 PacBio Hifi, CLR, and ONT datasets (Supplementary Tables 9–11 and Supplementary Fig. 7).Although benchmarking against the GIAB SV gold standard is an efficient and precise procedure to evaluate and compare the SV calling performance of different tools, the GIAB gold standard SV callset is not a complete set and could also contain false positives. We thus used T2T-CHM13 (v2.0) as an additional reference genome and the long reads from the CHM13 sample to call SVs and evaluate the false discovery rate. Specifically, we used hifiasm (v0.16) to assemble long reads into dual contigs, and applied different assembly-based SV callers to the contigs against both T2T-CHM13 and GRCh38 reference genomes (v2.1.0). We did not use the VolcanoSV-asm component for the diploid assembly since the CHM13 sample derives from a single homozygous complete hydatidiform mole and is not suitable for the haplotype-phasing module to achieve diploid assembly. The false discovery rate was computed as the ratio of the number of detected SVs against CHM13 to the number of SVs against GRCh38.Given the remarkably precise T2T CHM13 assembly generated with multiple complementary technologies, ideally, no SV should be discovered when the contigs assembled from CHM13 long reads are aligned to the T2T-CHM13 reference. However, we did observe a few SV calls when performing the variant calling due to the imperfection of assembly and bias introduced by the aligner and variant caller. Therefore, the number of false discoveries can serve as an objective measure of how robust a variant caller is. We performed this experiment on two Hifi datasets from CHM13 (denoted as Hifi_L6 and Hifi_L7) for different assembly-based SV callers, and the results were collected in Table 3. We observed that overall, VolcanoSV achieved the lowest false discovery rate (FDR) compared with the other three tools. Specifically, on Hifi_L6, VolcanoSV achieved the lowest baseline FDR at 3.03%, followed by Dipcall (3.05%), SVIM-asm (4.33%), and PAV (9.21%).Table 3 Comparison of false discovery rateIn terms of total calls against the T2T-CHM13 reference (relative to which the false discovery rate is calculated), VolcanoSV ranked second with 902 calls, following behind Dipcall, which had 737 calls. In terms of total calls against the GRCh38 reference, VolcanoSV had the highest number (29748), followed by SVIM-asm (26384), Dipcall (24187), and PAV (17540). On Hifi_L7, VolcanoSV maintained its lead with the lowest baseline FDR at 2.22%, followed by Dipcall (2.94%), SVIM-asm (3.36%), and PAV (4.52%). For total calls against the T2T-CHM13 reference, VolcanoSV had the lowest false calls (593), followed by Dipcall (705), PAV (794), and SVIM-asm (857). In terms of total calls against the GRCh38 reference, VolcanoSV had the highest number (26737), followed by SVIM-asm (25480), Dipcall (23976), and PAV (17563).We conducted additional analysis on the falsely discovered SVs based on the T2T-CHM13 reference. The results, illustrated in the UCSC Genome Browser (Supplementary Figs. 8 and 9) and Table 3, showed that 92.6%, 96.4%, 94.2%, and 92.7% of FD for Hifi_L6 by VolcanoSV, PAV, SVIM-asm, and Dipcall, respectively, were located in heterochromatin regions like centromeres and telomeres. For Hifi_L7, the percentages were lower, with 79.9%, 85.8%, 81.8%, and 82.4% of SVs identified by VolcanoSV, PAV, SVIM-asm, and Dipcall, respectively, also located in these complex regions. Notably, the majority of FDs were concentrated in chromosomes 1, 7, 9, 15, and 16. Heterochromatin regions pose challenges for assembly, often requiring specialized graph-based methods for accurate resolution.Phased SNPs, small indels and SVsWhile VolcanoSV was originally designed with a focus on detecting structural variants (SVs), the diploid assembly also provides the potential for uncovering small variants (≤50 bp) across the entire genome. We benchmarked SNPs and small indels by VolcanoSV with PAV and Dipcall since they both detect small variants through the assembly. In Hifi data (Supplementary Table 12), VolcanoSV demonstrated the highest F1 scores in SNPs for four out of five datasets and excelled in small indels for all five datasets, achieving peak F1 scores of 99.42% for SNPs and 98.34% for small indels. When applied to CLR data (Supplementary Table 13), VolcanoSV outperformed PAV and Dipcall, achieving the best F1 scores in both SNPs and small indels across all three datasets. Similarly, for ONT data (Supplementary Table 14), VolcanoSV exhibited superior performance in SNPs and small indels for all three datasets, although all three tools displayed low precision in small indel calls. It is noteworthy that detecting small indels based on assembly proved to be ineffective for ONT data.We note that VolcanoSV includes a haplotype phasing module, enabling not only detection but also phasing of all types of variants, including SNPs, small indels, and SVs. The end result is the production of a phased VCF file.Computational costsTo assess the runtime and memory usage of all tools, we used three different long-read datasets as representatives. All tools were tested on AMD EPYC 7452 Processor CPUs with 32 cores and 1TB memory. We first investigated the computation cost for the assembly phase. For Hifi_L1 (Supplementary Table 15), VolcanoSV-asm and hifiasm finished within 1474 and 440 CPU hours with peak memory usage reaching around 168GB and 214GB, respectively. In the CLR_L1 library, VolcanoSV-asm and Fly + HapDup finished within 2582 and 10673 CPU hours with peak memory usage of 259GB and 691GB, respectively; Finally, for ONT_L1, VolcanoSV-asm and Fly + HapDup finished in 4397 and 7886 CPU hours with peak memory usage of 85GB and 337GB, respectively. In summary, owing to the phasing and local assembly modules, VolcanoSV-asm demonstrated lower memory consumption and runtime than Fly + HapDup; it required more runtime and less memory compared to hifiasm. However, assembly procedures for assembly-based tools are often more computationally expensive than alignment procedures for alignment-based tools. For instance, when considering the same three libraries, alignment procedures utilizing minimap2 (v2.24-r1122)39 only consume 189, 118, and 116 CPU hours, with peak memory usage reaching approximately 31GB, 29GB, and 31GB, respectively.Secondly, we benchmarked the computational costs for the assembly-based variant calling phase. For Hifi_L1 (Supplementary Table 16), VolcanoSV-vc, PAV, SVIM-asm, and Dipcall finished within 21, 97, 17, and 11 CPU hours with peak memory of 21GB, 101GB, 37GB, and 48GB, respectively. For CLR_L1, VolcanoSV-vc, PAV, SVIM-asm, and Dipcall finished within 123, 1098, 7, and 10 CPU hours with peak memory of 215GB, 67GB, 26GB, and 42GB, respectively, For ONT_L1, VolcanoSV-vc, PAV, SVIM-asm, and Dipcall finished within 32, 133, 8, and 8 CPU hours with peak memory of 34GB, 53GB, 34GB, and 47GB, respectively. In summary, due to the variant filtering and refinement module, VolcanoSV-vc consumed less memory but more runtime than SVIM-asm and Dipcall on Hifi and ONT data. However, on CLR data, VolcanoSV-vc exhibited a substantial memory consumption. PAV consistently required significantly more runtime than other assembly-based tools. Compared to state-of-the-art alignment-based tools17,18,19,38, VolcanoSV-vc shows no significant difference in computation cost.Selecting assemblers for regions enriched in segmental duplicationsWe selected hifiasm and Flye as the default assemblers for various long-read datasets after evaluating several upstream long-read assemblers, including hifiasm26, Flye27, Peregrine40, wtdbg241, IPA42, HiCanu43, and Shasta44. Our assessment considered their impact on VolcanoSV’s downstream large indel SV calling performance. These two assemblers consistently demonstrated superior performance on PacBio Hifi (hifiasm), and for CLR and ONT datasets (Flye), respectively. However, users also have the flexibility within the assembly module in VolcanoSV to choose the most appropriate or new assemblers to fulfill their specific requirements. For example, when assembling regions enriched in segmental duplications (SDs), these two assemblers may not be the most suitable choice. To investigate the performance of different assemblers in SD-enriched regions, we employed Flagger (v0.3.3)45 to detect misassemblies enriched in SDs in three representative libraries. Flagger is a read-based pipeline that maps long reads to the phased diploid assembly in a haplotype-aware manner. It identifies coverage inconsistencies within these read mappings that are likely due to assembly errors. Details are provided in the Methods section.For ONT_L1, we additionally employed wtdbg2 (v2.2.5), Shasta (v0.10.0), and NextDenovo (v2.5.2)46 in VolcanoSV to perform local assembly for all phase blocks. Using Flagger, we first identified collapsed components in diploid assembly by VolcanoSV using different assemblers. These collapsed components represent regions with additional copies in the underlying genome that have been collapsed into a single copy, indicating potential misassemblies for SDs. We then further assessed the SD reliability of the Flagger annotation for collapsed components. To achieve this, we aligned the collapsed components from all diploid assemblies to the GRCh38 reference genome and intersected them with the SD annotations for HG002 based on GRCh38. We then calculated and compared the total length of SD annotation regions that overlap with collapsed regions by Flagger across all assemblies. The results demonstrated that VolcanoSV assembly using Flye, wtdbg2, Shasta, and NextDenovo generated 34.9 MB, 47.6 MB, 34.2 MB, and 22.7 MB of collapsed components, respectively, which suggested collapsed SDs.For CLR_L1, we additionally employed wtdbg2, and NextDenovo in VolcanoSV to perform local assembly for all phase blocks. The results demonstrated that VolcanoSV assembly using Flye, wtdbg2, and NextDenovo generated 28.8 MB, 32 MB, and 30.2 MB of collapsed components, respectively, which suggested collapsed SDs. For Hifi_L1, we additionally employed Hicanu (v2.1.1) in VolcanoSV to perform local assembly for all phase blocks. The results demonstrated that VolcanoSV assembly using hifiasm and Hicanu generated 17.4 MB and 15.7 MB of collapsed components, respectively, which suggested collapsed SDs.These results suggested that NextDenovo in VolcanoSV assembled fewer collapsed regions than Flye in ONT_L1, indicating its potential superiority for regions enriched in SDs. Similarly, Hicanu in VolcanoSV outperformed hifiasm in Hifi_L1. Notably, Flye in VolcanoSV assembled the fewest collapsed regions in CLR_L1 compared to other assemblers.

Hot Topics

Related Articles