ProcaryaSV: structural variation detection pipeline for bacterial genomes using short-read sequencing | BMC Bioinformatics

DatasetsWe created four artificial datasets to benchmark our pipeline. The first dataset is used to establish the optimal minCallers threshold values (minCallers dataset). The second dataset is used to validate these values and to benchmark the performance with other tools and pipelines (SV dataset). The third one has already been used in the past to benchmark CNV detection in bacterial genomes (CNV dataset) [12, 13]. Lastly, we evaluated the impact of GC content on detection. For overview see Supplementary Table.The first two datasets benchmark all SV types separately. We randomly defined 100 SVs for each SV type, deletion, duplication, inversion, and insertion (400 SVs combined). The length of the SVs ranged from 50 to 10,000 bp. We used SVim to generate these SVs [34]. Because SVsim simulates insertions by taking them from the other chromosomes in the input FASTA file, we have to use our in-house script to create artificial insertions. We benchmarked these coverage values as follows: 5 × , 10 × , 20 × , 50 × , and 100 × . The minCallers dataset uses Escherichia coli str. K-12 substr. MDS42 (GenBank: AP012306.1) as source of sequencing reads. We benchmarked insertions from another different bacteria and from another strain of E. coli to see how detection performs based on the origin of insertions sequence (see Supplementary Table for sequence details). The read length was 150 bp.The SV dataset was created in the K. pneumoniae genome subsp. pneumoniae NTUH-K2044 (GenBank: NC_012731), and artificial insertions were inserted from the S. aureus subsp. aureus USA300_FPR3757 genome (GenBank: NC_007793.1). The read length was 75 bp.The third CNV dataset consists of 30 artificial deletions and duplications of various short lengths and copy numbers imputed into the genome of Staphylococcus aureus subsp. aureus TW20 (GenBank: NC_017331). We benchmarked four coverage values: 5 × , 10 × , 20 × , 100 × , and 200 × . The read length was 75 bp. Artificial reads for all datasets were generated with art-sim [35].The real samples dataset contains sequencing data for bacterial isolates from three projects, which are listed in Supplementary Table [36,37,38]. In these datasets, there were 190 bacterial isolate samples sequenced with short reads with an average coverage of 60 × to 350 × .Artificial benchmarking involves resolving several aspects. First, efficient minCallers values are set based on the performance metrics. Second, we measured the performance of ProcaryaSV’s merging algorithm against the SURVIVOR merging algorithm. Third, the performance of the whole pipeline was compared to that of the Parliament2 pipeline. Lastly, we can see how merging improves the detection compared to individual callers.Defining optimal minCallers valuesWe employed the first dataset to obtain ideal values of minimal consensus threshold called minCallers in our pipeline. The minCallers values are defined independently for CNVs, inversions, and insertions.We assessed the optimal value of the minCallers parameters with the use of precision and recall values (see Supplementary Table). We ran the pipeline multiple times with different minCallers values being set. Then, we calculated performance scores for all the minCallers and coverage levels.For CNVs, defining exact minCallers values is not straightforward. The range of values from 2 to 4 seems to be optimal for coverage above 20 × . For lower coverage, setting minCallers to 2 or 3 is optimal. We repeated the analysis of optimal minCallers in the CNV dataset, which includes very small CNVs. Here, we discovered that the minCallers set to 2 achieved the highest accuracy and F1 scores across all coverage levels. Many false positives were detected for the minCallers set to 1, caused by Pindel and CNproScan, and were eliminated by increasing the value to 2. On the other hand, Pindel and CNproScan were able to detect the shortest CNVs. Observing the precision-recall curves (see Supplementary Table), we conclude that the optimal value of minCallers is 2 for a broad range of CNV lengths, including small lengths. For generally longer CNVs, which are easier to detect, the minCallers can be set to a value of 3 or 4.For inversions, the optimal minCallers value is 2. As inversions can be detected by three detection tools at most, the choice of 2 was straightforward. Also, the detection of inversions is not coverage-dependent.Insertions are most effectively called by INSurVeyor and three other tools. Since the elevated weight for INSurVeyor, the values 1 and 2 perform similarly. Furthermore, we compared the performance of insertion detection with distant and more similar sequences. Insertions of taxonomically close origin (different strain of bacteria in this case) are more challenging to detect.Artificial CNV datasetWe used the CNV dataset to evaluate the performance of ProcaryaSV for detecting small CNVs and to estimate the optimal value of the minCallers parameter for small CNV detection (see previous section). The complete results and plots are provided in the Supplementary Table.Second, we compared ProcaryaSV’s merging algorithm with the SURVIVOR merging algorithm. SURVIVOR was chosen because of its easy implementation and usability with selected callers. Additionally, we tested the SVDB merging tool [39]; however, the output was not reliable for use because the SVDB merging tool inserts modified sequences into the vcf file. The SURVIVOR merge settings for minimal callers were set to 2, the maximum allowed distance was set to 1000, and the minimal considered SV length was set to 1. The minCallers parameter of ProcaryaSV was also set to 2. We achieved similar results, as shown in Table 2. The values in bold signify the highest values. The ProcaryaSV had higher accuracy and F1 scores for 20 × and higher coverage by a few percent. Generally, the results are comparable to what we expected. The results reflect the congruency between the methods.Table 2 Results of the artificial CNV dataset (ProcaryaSV minCallers 2)Third, we compared the performance against the Parliament2 pipeline. We used Parliament2 with Breakdancer[40], CNVnator[19], DELLY2[21], Manta[41], and LUMPY[20]. Merging in Parliament2 is performed natively with SURVIVOR[30]. We put the results into Table 2. The first notion is that Parliament2 results are less coverage independent. However, we achieved higher scores except for precision and specificity at 100 × and 200 × coverage.Finally, we analyzed the redundant tools via UpSet plots [42]. These plots are in the Supplementary Table. Considering only CNVs, the CNVnator detected the least number of CNVs. None of them were detected uniquely by the CNVnator. On the other hand, it participated in the detection of some low-coverage events.We also evaluated individual callers separately to see how the SV merging improved overall detection (see Supplementary Table). Regarding the F1 scores, the DELLY2 and LUMPY are the best-performing tools across all the coverage levels. In sensitivity, the CNproScan detects the highest number of true positives. Generally, the merging of small-sized CNVs brought performance benefits.Artificial SV datasetWe benchmarked the performance also on the validation SV dataset and compared it again with SURVIVOR, Parliament2, and independent tools. We set the minCallers threshold for all SV types to 2 but also included a value of 3 for CNVs. The F1 scores of the competing methods are shown in Table 3. For the rest of the metrics and plots, see Supplementary Table.Table 3 F1 scores of the artificial SV dataset (ProcaryaSV minCallersCNV = 2, minCallersINV = 2, minCallersINS = 2)The performance for deletions and duplications was stable across different coverage values. The detection of large CNVs is not as dependent on coverage as the detection of short CNVs is. The inversion results were also stable, with a small decrease toward high coverage. This was caused by 5 inversions misclassified as duplications.The most challenging part was the detection of insertions. This is attributed to the nature of short-read sequencing. In the evaluation, we increased the boundaries of the exact breakpoints by 50 bp so that we could match the breakpoints with the detected insertions. Of the four tools used to detect insertions, three can detect only short insertions via split-read alignment (Pindel, LUMPY, DELLY2). This requires that the length of the insertion fit into a single read length [20,21,22]. Furthermore, detection is difficult when the insertion is similar to regions in the reference genome. The employed INSurVeyor uses read-pair and de novo assembly methods to detect a larger scope of insertions [23]. INSurVeyor is responsible for a major boost in insertion detection against competitors.As previously described, the merging results are comparable between the two implemented algorithms, ProcaryaSV’s and SURVIVOR’s merging methods. The differences are small, benefitting the first method by a few points.ProcaryaSV slightly outperformed Parliament2 in terms of deletions and largely in detection insertions. The Parliament2 was slightly better in inversions. Both pipelines performed similarly in duplications. The performance of Parliament2 matched the performance of ProcaryaSV when minCallers were set to 2.DELLY2 and LUMPY, as in the CNV dataset, are very well-performing tools in the detection of deletions, duplications, and inversions. Unlike in CNV dataset results with short CNVs and indels, the performance of these two individual tools is comparable with the merging approach. However, when we observed the results, we noticed that individual tools tend to call multiple shorter events along the original long SV. All these are detected as true positives in our case (as they overlap with defined intervals), but the merging method overcomes this drawback of individual callers and merges them into one continuous event.Finally, we observed the performance of each tool via UpSet plots of the true-positive SVs (see Supplementary Table). Unlike for short CNVs, the least well-performing tool for CNV detection was Pindel. The majority of events were detected by the other tools. Pindel was more useful for detecting inversions, yet a large share of detected inversions were also called by other tools. In contrast, Pindel is indispensable for insertion detection. Most of the insertions were detected with INSurVeyor seconded by Pindel.GC content impact on detectionLastly, we evaluated the impact of GC content on detection. We benchmarked simulated CNV datasets of three different bacteria, representing low, middle, and high GC content. The selected bacteria were Staphylococcus aureus (GC 33%), Klebsiella pneumoniae (GC 57%), and Anaeromyxobacter dehalogenans (GC 74%). The details about sequences and dataset creation are in the Supplementary Table, although the same recipe as in the SV dataset was used.The detection results correspond to the previous results. The lowest and the highest coverage is slightly the most challenging. However, we found no performance impact associated with different GC content, as can be verified in the Supplementary Table.Real datasetBenchmarking on real data was performed to assess the usability of the pipeline for real data. Since no apriori-defined SVs are known, the space for evaluation is limited. We can conclude the overlap between various tools and features of detected SVs depending on the SV class and the caller. The results are provided in the Supplementary Table. All samples of the same species were pooled together in the final analysis.All SV types were detected in samples of K. pneumoniae. The majority of inversions were called by DELLY2 and Pindel, unlike the combination of DELLY and LUMPY in the artificial dataset. The insertion results copied those of artificial ones, with DELLY2 detecting a significant portion of the insertions. Interestingly, only two duplications were detected by all five tools. The number of detected deletions was much greater. The L. casei samples had the lowest number of SVs. No insertions were detected. In contrast, S. aureus had high numbers of SVs called by multiple tools. There were 60 insertions called by both Pindel and INSurVeyor, and the number of inversions called by the three callers was also high.Despite that we cannot point to the accuracy of the real dataset detection, it is interesting to see the differences in detection compared to artificial datasets. While DELLY and LUMPY performed well as individual CNV and inversion callers on the artificial SV dataset, they each detected a distinct set of SVs. Here, we see fully the benefit of the merging approach.Computational performance of ProcaryaSVData analysis of bacterial genomes, which are only megabases in length, is not a computationally demanding task given modern PC specifications. The pipeline was tested on a 12-thread CPU with 64 GB of RAM. The run times reported by the Snakemake workflow are in the Supplementary Table. The RAM usage generally did not exceed 10 GBs when 12 threads were used.

Hot Topics

Related Articles