HyLight: Strain aware assembly of low coverage metagenomes

In the following, we first briefly discuss the workflow. Details that are necessary to fully reproduce the workflow from a conceptual point of view are provided in Methods. Subsequently, we present experiments on both simulated and real data, which provide evidence for the benefits of our approach, as listed towards the end of the Introduction.HyLight’s major innovation lies in the construction of a strain-resolved overlap graph (OG) as input for assembling long reads, correcting contigs, and clustering and assembling short reads, ultimately achieving strain-aware assembly.WorkflowPlease see Fig. 1 for a schematic of the workflow. The workflow proceeds in two axes: one for assembling the long reads (see left branch in Fig. 1) and one for assembling the short reads (right branch in Fig. 1). Assemblies of long and short reads are merged in a final step (see bottom of Fig. 1).Fig. 1: Workflow of HyLight.The input data consists of two fastq files, long reads and short reads. The output is a fasta file containing the assembled contigs. The overall procedure can be divided into three primary steps. Firstly, strain-resolved OG is conducted to assemble long reads. Subsequently, another OG is established to assemble short reads. Lastly, a contig OG is established to extend the contigs obtained from the assembly results of long and short reads, culminating in the generation of the final master contigs.In a brief summary, HyLight performs the following items. First, it corrects long reads using short reads, which turns the raw TGS reads into polished, error-free long reads. The resulting polished long reads then are the basis for constructing a strain-resolved overlap graph that gets further polished by removing remaining errors.To provide an overview of the workflow, this section offers a high-level description. For detailed descriptions of all methodical steps involved, please refer to the “Methods” section. Figure 1 illustrates the overall workflow of HyLight. As outlined previously, HyLight comprises three main modules.First module: long read assemblyThe main purpose of this axis is to compute strain-aware, error-free contigs from the long reads.

1.

Long reads are corrected using short reads using FM-index and de Bruijn graph based techniques, as implemented in FMLRC249, which has been shown to outperform other methods in recent benchmark studies50,51.

2.

An overlap graph is constructed from the corrected long reads. We make use of the (widely popular) Minimap252 to compute the necessary overlaps.

3.

We identify overlaps that connect long reads from different strains by inspecting SNP patterns, and we remove edges in the overlap graph that reflect the connection of long reads from different strains. See Fig. 2 for an illustration. The result is an overlap graph of the long reads that consists of connected components each of which contains long reads from only one particular strain. So, each connected component in the overlap graph now reflects a collection of reads drawn from one haploid genome.Fig. 2: Assembly of long reads.Long reads are pre-corrected using short reads. The distinct colors of the reads indicate their respective strain origins. The objective of this workflow is to leverage SNP information to filter out incorrect overlaps, selectively retaining overlaps between reads originating from the same strain. This enables strain aware assembly to be performed effectively.

4.

We assemble the long reads based on the resulting strain-aware overlap graph using Miniasm53, which is a long read assembler that addresses to assemble haploid genomes from long reads. The result are contigs each of which stems from one particular strain.

5.

We re-align the long reads against the resulting strain-aware contigs.

6.

Based on the re-alignment, we establish a second, improved version of an overlap graph for the long reads, which now reflects a strain-aware overlap graph of the long reads.

7.

Using this improved, strain-aware overlap graph, we remove errors that have remained in the long reads using Racon54.

As for 7, note that Racon has not been designed to operate in a strain-aware manner. If one fed the original, strain-unaware overlap graph to Racon, it would “overcorrect” contigs by mistaking true, strain-specific variation for errors and eliminating them. This would mask strain-specific variation hence prevent the reconstruction of genuine strain-specific sequence. One can consider the application of Racon to only strain-aware overlap graphs an insight that is crucial for computing both error-free and strain-aware long read based assemblies.Second module: short read assemblyWe recall that it is a general objective to establish a workflow that caters to low coverage of long reads to avoid unnecessary and possibly unaffordable costs. Therefore, the main purpose of this axis is to assemble the (likely high coverage, because cheap) short reads in their own right, and use the assemblies to fill gaps in the long read based assembly, or even identify additional strains from the resulting contigs.

1.

We align the short reads against the strain-aware, error-free contigs, as the output of the long read axis (first module), using Miniasm53.

2.

The alignment of short reads with long read contigs gives rise to an overlap graph of the short reads.

3.

Analogously to the long read axis, we inspect SNP patterns in the overlap of the short reads. Based on the SNP patterns, we identify overlaps of short reads that reflect to connect two short reads from different strains. See Fig. 3 for an illustration.Fig. 3: Assembly of short reads.The distinct colors of the reads indicate their respective strain origins. The primary procedure consists of aligning the short reads to the contigs, establishing a strain-resolved OG, and then excluding short reads that align to regions already assembled into contigs. Subsequently, an OG is constructed to assemble the remaining short reads and reconstruct strains or regions that were not initially assembled.

4.

As a result, we are now able to identify short reads whose SNP patterns contradict their initial alignment with the long read contigs. The insight is that breaking up overlaps between short reads all of which align with the same long read contig leads to several classes of short reads. Only one of the classes of reads truly agrees with their respective long read contig (yellow short reads in Fig. 3).

5.

We conclude that short reads no longer having overlaps with short reads whose SNP patterns truly match those of their long read contigs, do not stem from the same strain as the long read contig against which they initially aligned (blue short reads in Fig. 3).

6.

Further, we collect all short reads that did not align with any of the long read contigs (gray short reads in Fig. 3).

7.

We discard all short reads whose alignments indicated full agreement with a long read contig (yellow in Fig. 3).

8.

Subsequently, using StrainXpress12 (which as discussed in the Introduction specializes in the strain-aware assembly of short reads using OGs), we assemble all short reads whose alignments were not in full agreement with their long read contigs (blue in Fig. 3) or which had remained entirely unaligned with long read contigs (gray in Fig. 3). See the bottom of Fig. 3 for an illustration of the resulting assemblies.

As per the properties of StrainXpress, the result are strain-aware short read based contigs. As per the protocol we follow, all contigs refer to strain-aware metagenomic sequence not captured / spanned by any of the long read contigs from the first module.Third module: merging long and short read assembliesThe purpose of this final module is to compute a unifying assembly that is as comprehensive as possible, as the final output of our approach.

1.

One collects both long and short read contigs, as the output of the first and second module, and computes overlaps between them to establish an encompassing strain-aware overlap graph.

2.

One identifies nodes in the OG through which only one particular path passes (“simple path”).

3.

One extends contigs along the identified “simple paths”. The resulting extended contigs are the final output.

Data & experimental setupThe experiments we discuss in the following refer to both simulated and real data.The synthetic data we treat refer to scenarios that reflect different levels of complexity in terms of strain content. In particular, we deal with data sets reflecting 3 Salmonella strains, and further 20 (low complexity), 100 (medium complexity) and 210 (high complexity) strains from various bacterial species. Strains were retrieved from (55, DESMAN); see Supplementary Data for full information on strain composition of data sets. All data sets were simulated using CAMISIM56, which reflects a state-of-the-art and widely popular choice for generating metagenome sequencing data sets. Further, we also consider 6 “strain-mixing spike-in” data sets, which reflect spiking simulated reads from Salmonella strains into real data. This creates a real data scenario for which ground truth (in form of simulated reads) is available. See the “Methods” section for full technical details.The real data are two microbial communities that reflect the current standard in terms of available real, both TGS and NGS data with known ground truth. Both Bmock12 (a bacterial mock community) and NWC (a natural whey culture data set) have already been widely used in the evaluation of metagenome assembly approaches12,13,51,57,58. For both data sets, reference genomes, Illumina, PacBio CLR and ONT reads are readily available. See again the Methods section for full technical details.To assess the performance of the hybrid strategy and the assemblies generated solely from high-quality HiFi reads, we created a mock community by mixing real sequencing data from three yeast strains. The three sequencing data sets were originally intended for evaluating different sequencing data platforms and assembly methods59. Consequently, these data sets contain PacBio HiFi reads, ONT reads, and NGS reads. In this study, they serve as an excellent basis for assessing the performance of HyLight, which runs on NGS and ONT reads, in comparison to Hifiasm-meta60 and metaMDBG61, which solely utilize PacBio HiFi reads for assembly.Benchmarked approachesAs discussed in the Introduction, the state-of-the-art when comparing hybrid metagenome assembly approaches that operate in a strain-aware manner, are Strainberry, as the leading approach to deal with only TGS data and StrainXpress, as the leading approach to only deal with NGS data. Hybrid metagenome assembly approaches that address strain awareness have not yet been presented before; here, we consider all state-of-the-art approaches to metagenome assembly that operate at the species level. These are HybridSPAdes25, MetaPlatanus28, Unicycler26 and Opera-MS27. Last, we also compare HyLight with Hifiasm-meta and metaMDBG to assess differences in quality of HyLight’s hybrid assemblies with PacBio HiFi only based assemblies, as generated by the current state-of-the-art assemblers.Note on metricsIn the following, we evaluate the performance in terms of metrics that are routinely computed by MetaQUAST V5.1.0rc162 and Merqury63. For MetaQuast, we particularly focus on “Genome Fraction” (GF), as a metric that refers to strain awareness (GF = 100.0 translates into full strain awareness), NGA50, as a metric deemed sufficiently reliable to measure contig contiguity, and (mismatch / indel) error rates as well as misassembled contig fraction (MC) to evaluate the quality of the contigs. Among the evaluation metrics of Merqury, the focus is primarily on “Completeness”, which, similar to MetaQuast’s Genome Fraction, reflects the proportion of the genome covered by the assembled contigs. Additionally, we pay attention to the error rates reported by Merqury. Importantly, note that Merqury, as a k-mer based tool introduces particular biases in its evaluation, which was noted earlier where it was found to favor k-mer based tools64. Also, here, the corresponding statistics appear statistically uncertain; obviously, evaluating experiments without a ground truth comes at a price, which is not surprising. See “Methods” for full details on MetaQUAST and Merqury.Note on classification of approachesWe recall that the state-of-the-art hybrid assembly approaches primarily target the accuracy and the length of the assemblies, but do not address strain awareness. Strainberry and StrainXpress do primarily target at strain awareness. As a trade-off, they suffer from more erroneous (Strainberry) or shorter (StrainXpress) assemblies due to the nature of the type of data they use as their input: Strainberry and StrainXpress only use TGS or NGS data, respectively. HyLight is the sole approach that addresses accuracy, contiguity, and strain awareness at the same time.Here, because of the different primary goals of the approaches, we would like to avoid to compare prior hybrid assembly approaches with prior non-hybrid approaches that focus on strain awareness. Therefore, in the following, we first compare HyLight with the prior hybrid assembly approaches, and, subsequently, in separate paragraphs, present a comparison of HyLight with Strainberry and StrainXpress.Misassembled contig rate of strain aware assemblersThe results of HyLight, Strainberry and StrainXpress with respect to misassembled contig rate (MC in Tables 1 & 2) remain very consistent across all datasets. To avoid redundancies when comparing HyLight with Strainberry and StrainXpress in terms of misassembled contig rate, we will not go into detail with respect to each of the data sets we run experiments on. As a general trend—which applies with no exception on any of the data sets—HyLight and StrainXpress considerably outperform Strainberry. For (the solely NGS based) StrainXpress, this can certainly be attributed to the reduced length of the contigs. For Strainberry, this can be attributed to being based on solely TGS data, which prevents the detection of misassemblies thanks to the accuracy of auxiliary NGS data.Table 1 Benchmark results for assembly simulated PacBio CLR readsTable 2 Benchmark results for assembly real readsExperiments: synthetic data sets3 SalmonellaThis data set contains simulated reads from three distinct strains of Salmonella. The average coverage for Illumina (NGS) and PacBio (TGS) reads is 20X and 10X, respectively, reflecting a low-coverage TGS data scenario in particular, as intended. Despite the low number of strains (3), the high degree of similarity between them ensures that only approaches that are sufficiently strain aware are able to assemble them without confounding them. The data set serves as a test bed for evaluating basic properties of the benchmarked approaches. See “Methods for full details.
Hybrid assembly approaches
See Table 1 for corresponding results. HyLight outperforms all other hybrid assembly approaches in terms of all relevant metrics. It covers 23.78% more strain sequence than the second best hybrid assembly approach (HyLight: 96.03; MetaPlatanus: 72.25), missing out on only 4% strain-specific sequence. HyLight also dominates the other approaches in the other relevant categories, where improvements are to be measured in terms of orders of magnitude. For example, it improves NGA50 by a factor of 5 (HyLight: 351 848; MetaPlatanus: 68 613), indel error rate by a factor of 24 (HyLight: 0.85/100 kbp versus 20.56/100 kbp), mismatch error rate by a factor of 13.7 (HyLight: 23.56/100 kbp; MetaPlatanus: 324.99/100 kbp) and missambled contig rate (MC) by a factor of 8.4 over the second best approach (HyLight: 0.19%; MetaPlatanus: 1.6%).

Strainberry / StrainXpress
See Table 1. Strainberry encountered difficulties in the phasing step due to the high similarity among the three salmonella strains (ANI  > 99%). As a result, Strainberry could not identify sufficiently many SNPs to separate reads from contigs, as assembled by Metaflye. Consequently, Strainberry was unable to assemble this dataset. StrainXpress achieves Genome Fraction that is superior over all prior (including hybrid) approaches, but outperformed by HyLight (90.99%), and achieves excellent results in categories relating to contig quality (errors and misassemblies). However, in terms of contiguity, StrainXpress lags behind all other approaches, by large margins. This is no surprise, of course, since StrainXpress is the only approach that does not make use of long reads, which limits its potential to output longer contigs.
20 bacterial strainsThis data set consists of 20 strains from 10 different species, resulting in an average of two strains per species. The average coverage for Illumina (NGS) and PacBio (TGS) reads is 20X and 10X, respectively, again reflecting a TGS low coverage scenario. For further details, please refer to the “Methods” section.
Hybrid assembly approaches
See again Table 1. Also on this data set, HyLight outperforms the other four methods across all categories. HyLight achieves a Genome fraction of 91.76%, surpassing the current best method by more than 21% (MetaPlatanus: 70.33%). The NGA50 of HyLight is 139,730, which exceeds the second best NGA50 by a factor of 3 (Unicycler: 46 894). Regarding errors, HyLight’s contigs mark a threefold improvement in terms of indel error rates (HyLight: 3.97/100 kbp; HybridSPAdes: 11.83/100 kbp), and the mismatch error rate is four times lower than the toughest competitor (HyLight: 59.96/100 kbp; MetaPlatanus: 238.79/100 kbp). Finally, there are only half as many misassembled contigs relative to the second best competing method (HyLight: 0.20%; HybridSPAdes: 0.44%).

Strainberry / StrainXpress
As was expected, both Strainberry and StrainXpress are competitive with respect to Genome Fraction. However, while StrainXpress (93.45%) even outperforms HyLight (91.76%), Strainberry achieves only 78.43%, which from an overall perspective (including hybrid approaches) still is remarkable. In terms of contiguity, unlike Strainberry’s assembly, whose contiguity is worse than that of HyLight, but still competitive, StrainXpress’ NGA50 is smaller by a factor of more than 40 in comparison with HyLight. Further, Strainberry’s error rates are largely on par with the low error rates of HyLight and StrainXpress, which is somewhat surprising in particular for indel errors. While the good error rates of Strainberry can be attributed to the low complexity of the data set, results in the other categories reflect expected outcomes when working with low coverage TGS and/or (medium coverage) NGS data. Here, just as much as on all other data sets, StrainXpress has the lowest misassembled contig rate (0.04% vs. 0.20% by HyLight).
100 bacterial strainsThis data set consists of 100 strains from 30 species, at an average coverage of 20X per strain for NGS (Illumina) and the (as usual low) 10X per strain for TGS (PacBio CLR) reads. The data set is designed to reflect a more complex scenario. The idea is to evaluate which of the available approaches potentially become confused if the mix of strains becomes more complex and more diverse.
Hybrid assembly approaches
See Table 1). In fact, despite a slight increase in terms of errors, HyLight remains unaffected by the elevated complexity and continues to outperform the other four methods. Note first that neither HybridSPAdes nor Unicycler was able to perform the assembly within a month time, so we terminated the corresponding runs (on 32 CPUs and 500 GB RAM) not terminating when the strain number reached 100, short read volume reached 16G, and long read volume reached 10G). As for Genome Fraction, HyLight outperforms the other methods by at least 17%, where GF even exceeds the GF achieved on the low complexity data set (HyLight: 93.99%; MetaPlatanus: 76.6%). The NGA50 exceeds the second best one by 3.5 times (HyLight: 163 296; MetaPlatanus: 46 937). Indel error rates are still lower by a factor of more than 2 (HyLight: 16.87/100 kbp; MetaPlatanus: 36.8/100 kbp) and mismatch error rates are smaller by a factor of more than 5 (HyLight: 75.95/100 kbp; MetaPlatanus: 407.42/100 kbp). Misassembled contig rate is smaller by a factor of more than 2 (HyLight: 0.93; Metaplatanus: 2.05).
Thanks to variations in the average coverage of the TGS data of the 100 strains (average coverage follows a log-normal distribution by the design of CAMI), one can analyze the influence of coverage on the quality of the assemblies of the different strains. See Fig. 4 for the corresponding results. In comparison to the other two methods whose runs terminated successfully, HyLight generally achieves greater Genome Fraction across all strains. HyLight’s advantages become particularly noticeable at coverage rates below 20X where HyLight outperforms the other methods by large margins with respect to all categories that refer to strain awareness and error content.Fig. 4: Assemble 100 strains.Among these 100 strains, their average coverages follow a log-normal distribution, resulting in variations in the average coverage of each individual strain. Here, we assess the impact of different coverages on the assembly methods of HyLight, MetaPlatanus, and OPERA-MS. Different colors represent different assembly approach. (a) As coverage increases, there is a change in the genome fraction for distinct approaches. (b) As coverage increases, there is a variation in misassembly contig length among different assembly methods. c and d Increase in coverage, changes in mismatch and indel error rate in the assembly results of different approaches. Box plots represent the median (center line), the 25th and 75th percentiles (bounds of the box), and the minimum and maximum values within 1.5 times the interquartile range (whiskers).

Strainberry / StrainXpress
Again, StrainXpress excels in terms of strain awareness (Genome Fraction), closely followed by HyLight. Although the margin between Strainberry and HyLight is considerable, Strainberry still achieves remarkable strain awareness with from an overall perspective that takes the other hybrid approaches into account. The increased complexity of the data has an impact on contiguity and error rates of the assemblies. Strainberry now has considerable disadvantages in terms of error rates, in particular with respect to indel errors, while HyLight and StrainXpress preserve excellent error rates. StrainXpress has considerable disadvantages in terms of contiguity, where HyLight exceeds the NGA50 of Strainberry by more than 2 times.
210 bacterial strainsThis data set consists of 210 strains from 100 species as provided by55, for which reads were simulated using CAMISIM56. As usual, depth of coverage is 10X for TGS (so low coverage), and 20X for NGS reads. The data set is supposed to reflect a scenario of utmost complexity with respect to numbers of species and their strains.
Hybrid assembly approaches
See Table 1 for results. By and large, HyLight, as well as Opera-MS, as the only methods whose runs terminated within a month time, approximately mirror the results achieved on the data set containing 100 strains. HyLight outperforms Opera-MS by 19% in terms of Genome Fraction (HyLight: 90.49%; Opera-MS: 71.15%), has three times longer contigs (NGA 50: HyLight: 128 015; Opera-MS: 43 045), has more than 3 times lower indel error rates (HyLight: 24.59/100 kbp; Opera-MS: 185,34/100 kbp), 7 times lower mismatch error rates (HyLight: 66.41/100 kbp; Opera-MS: 483.57/100 kbp) and more than 4 times less misassembled contigs (MC: HyLight: 1.63%; Opera-MS: 7.15%).

Strainberry / StrainXpress
Results largely repeat the achievements from the data set on 100 strains just discussed, but become even more distinct in terms of the expected advantages and disadvantages of the approaches. Both approaches outperform other approaches in terms of strain awareness. On this most complex data set, HyLight finally also clearly outperforms StrainXpress. While Strainberry has drawbacks with respect to error rates, StrainXpress considerably trails in terms of contiguity, with HyLight clearly outperforming both approaches in these categories.
Strain-mixing spike-in datasetsBy its design, these data sets can be used to investigate the influence of the coverage of the NGS reads in hybrid assembly. To enable such experiments, we made use of 10 highly identical Salmonella strains, which we spiked into real metagenome samples. While the coverage of spiked-in long reads was fixed to 10X, the coverage of spiked-in NGS reads varied from 5X to 30X, in steps of 5X, resulting in 6 different levels of coverage. These 6 different NGS read sets were spiked into 6 different real metagenome sequencing data sets, amounting to 36 different data sets overall. For each of these 36 data sets, the task is to assemble the genomes of the 10 Salmonella strains, in a strain-aware manner.For the evaluation, note that we do not make use of MetaQuast, because MetaQuast was not able to align the contigs against the reference effectively due to the high identity of many strains, often amounting to average nucleotide identity (ANI) of more than 99%. This implied that indel and mismatch error rates were evaluated as excessive for all methods apart from HyLight (for example, the mismatch error rate was evaluated as 6859.95/100 kbp for Opera-MS, which cannot be correct, see Supplementary Data). For fairness reasons, we therefore resorted to using Quast65 with the same parameters, because we realized that Quast aligned contigs with the reference genomes more accurately. This immediately entailed that the error rate of the competitors dropped substantially (e.g., Opera-MS now at 1753.41/100 kbp). See Supplementary Data for both Quast and MetaQuast evaluated results.An additional challenge was that MetaPlatanus consistently threw errors when dealing with data sets of only 5X or 10X simulated NGS coverage. Despite reaching out to the authors (via GitHub), the issue could not be resolved. Therefore, we only display results for datasets of simulated NGS coverage 15X and greater.
Hybrid assembly appraoches
See the Fig. 5 for results. HyLight outperforms the other methods across different coverages. For example, HyLight achieves an average Genome Fraction that exceeds those of other approaches by at least 28.81% (24.65% ~ 26.93%). Note that at coverage 5X, the Genome Fraction of HyLight drops to 77.45%. Genome Fraction for HyLight already increases to 85.03% when increasing coverage to 10X. Increasing coverage further does not lead to any more significant changes. Subsequent increases in the coverage of the most community did not have a more significant changes (Genome Fraction rises to nearly 90% from 15X and onwards).Fig. 5: 10 Salmonella strains spike-in.Simulated reads of 10 Salmonella strains were mixed with real sequencing data, followed by assembly for the combined datasets. Subsequently, the quality of assembly results for these spike-in strains in a complex environment was evaluated. During the incremental increase of coverage from 5X to 30X, the variations in genome fraction (a), indel error rate (b), mismatch error rate (c), and misassembly contig rate (d) were presented for these five assembly methods.
Among the prior hybrid assemblers, Unicycler achieves the lowest indel error rate (17.97/100 kbp  ~37.98/100 kbp) where HyLight (2.73  ~ 5.47/100 kbp) achieves an error rate of only 17.5% of that of Unicycler (Fig. 5b). Improvements of HyLight over prior hybrid assemblers in terms of mismatch errors are even more distinct, decreasing the one of the top competitor by about 90%. Figure 5d finally displays comparison in terms of misassembled contig rate (MC): HyLight’s MC is only 17.5% of that of HybridSPAdes, as the toughest competitor.
In summary, HyLight outperforms the state-of-the-art in hybrid assembly by large margins with respect to the most relevant key assembly metrics, in a scenario that is characterized by near-identical strains embedded into complex real backgrounds.
Strainberry / StrainXpress are not evaluated, because the experiment only makes sense for evaluating particular qualities of hybrid assembly approaches.
Experiments: real data setsWe conducted further evaluations of all approaches using four real datasets: “Bmock 12 PacBio”, “Bmock 12 ONT”, “NWCs PacBio” and “NWCs ONT”. While TGS coverage of the 4 data sets amounted to 22.11X, 18.1X, 127.2X and 89.01X (in the order of having listed data sets before), NGS coverages reached 275X for Bmock12 and 35.62X for NWCs.The ‘Bmock12’ dataset consists of 11 strains from 9 species. Due to the low number of strains per species, also strain-unaware approaches are to achieve fairly high Genome Fraction overall. This also means that a sufficiently thorough analysis of the strain awareness of the approaches requires to break down results relative to the strains that make part of the data set. In this vein, one notices that among the species present, only Marinobacter and Halomonas have more than one strain, see Table 3 for summarizing statistics that refer to “Bmock 12 ONT” (statistics for “Bmock 12 PacBio” are similar, see Supplementary Table 1). While the two strains of Marinobacter exhibit 85% average nucleotide identity (ANI), the two Halomonas strains have an ANI of 99%. This points out that methods should be evaluated with a view to their performance on Halomonas strains in particular.Table 3 The genome fraction of each individual strain in the Bmock12 data (Illumina and ONT)The NWC dataset includes 3 species (Streptococcus thermophilus, Lactobacillus delbrueckii, Lactobacillus helveticus), each of which has 2 strains, at ANIs of 99.99%, 99.24%, and 98.03%, respectively. For more detailed information, please see Supplementary Table 3. Despite the limited number of strains and their relatively low complexity, assembly remains challenging due to the high degree of similarity affecting the two strains of a particular species.Bmock12 ONT
Hybrid assembly approaches
See Table 2 for the following results. Due to the reduced level of complexity in terms of the variety of strains, also strain-unaware, species-level metagenome assembly approaches are expected to deliver good performance in reconstructing the individual genomes. Despite the reductions in overall coverage due to the subsampling procedure (inducing a low coverage TGS data scenario), both HybridSPAdes and Unicycler were unable to complete the assembly process within one month runtime.
HyLight outperforms both Opera-MS and MetaPlatanus, which are the two hybrid assembly approaches whose runs terminate in acceptable time when examining the relevant criteria and putting them into mutual perspective. Genome Fraction of HyLight is 4.4% higher than that of the second-ranked MetaPlatanus (99.77% vs. 95.37%). MetaPlatanus achieves the greatest NGA50 (789,960 vs. 281,944), which, however, can be explained by the unusually large number of Ns in its contigs, whose primary purpose is to link and extend contigs by force without that read evidence for the missing sequence context of contig links can be provided. HyLight improves by more than one order of magnitude over the other methods in terms of indel and mismatch errors. The indel error rate of HyLight is only 6.9% of that of the second-ranked OPERA-MS (1.43 vs. 20.78 per 100 kbp), and the mismatch error rate of HyLight is only 5.4% of that of the second-ranked MetaPlatanus (3.58 vs. 66.81 per 100  kbp).
We further examined the assembly status for each strain making part of the mock community individually. Genome Fraction for the individual strains is displayed in Table 3. One immediately realizes that all approaches reconstruct at least (about) 99% of the strain-specific sequence for all but the two Halomonas strains whose ANI comes at 99%. On Halomonas sp.HL-4 in particular, HyLight achieves a Genome Fraction that is greater by 7.58% than that of MetaPlatanus and 8.93% than that of Opera-MS (HyLight: 99.36; MetaPlatanus: 93.05; Opera-MS: 90.7). This lets one conclude that HyLight is the only hybrid metagenome assembly approach that operates in a strain-aware manner even when the ANI between strains is as great as 99%. As an additional insight gained from the analysis of the quality of the assemblies that is stratified relative to the individual strains, one realizes that all hybrid assembly approaches are able to reconstruct about 99% of the strain specific sequence even when TGS sequencing coverage is as low as 1.8X and also as low as 14.91X for the respective NGS coverage for a particular strain (here: Micromonospora echinaurantiaca). This documents the general value of hybrid metagenome assembly with respect to its favorable behavior in terms of sequencing (in particular TGS) coverage demands.

Strainberry / StrainXpress
While StrainXpress achieves competitive performance with respect to Genome Fraction (99.04%), Strainberry somewhat unexpectedly, no longer does (67.60%), while HyLight outperforms also StrainXpress (99.77%). While Strainberry has drawbacks with respect to error rates, StrainXpress considerably trails in terms of contiguity. HyLight outperforms both approaches in terms of both indel error and mismatch error rates, but is outperformed by Strainberry in terms of contiguity.
Bmock12 PacBioNote that NGS reads used here agree with those from “Bmock12 ONT”, whereas the TGS reads now stem from PacBio CLR sequencing platforms. This means in particular that the results of StrainXpress (see further below agree with those achieved on “Bmock12 ONT”.
Hybrid assembly approaches
See Table 2 for results. In an overall summary, results here mirror results achieved on “Bmock12 ONT”, where the advantages of HyLight look less: HyLight maintains a Genome Fraction that exceeds that of the other approaches by more than 4%, and although less distinct, still exhibits considerably lower error rates. The NGA50 of MetaPlatanus exceeds that of HyLight considerably, again put into context by the large number of N’s in the MetaPlatanus contigs. Misassembled contig rate of HyLight is again roughly on a par with that of MetaPlatanus, where now MetaPlatanus has slight advantages (but see below for a more fine-grained analysis that provides explanations). Although results look more favorable for MetaPlatanus from a greater persepctive, breaking down results by strain (see Supplementary Table 2) reveals that MetaPlatanus substantially struggles in reconstructing one of the Halomonas strains: while achieving 93.05 Genome Fraction for Halomonas sp.HL-4 on ONT data, MetaPlatanus only achieves 82.19 Genome Fraction on PacBio data, despite even having small advantages over HyLight on the other Halomonas strain. Analogous trends become evident for Opera-MS. With respect to the misassembly contig rate, a more detailed analysis further demonstrates that the MC of the raw long reads (i.e., evaluating raw long reads as contigs in their own right) for “Bmock12 ONT” and “Bmock12 PacBio” comes out at 2.33% and 7.55%, see Supplementary Table 1. The MC of the raw long reads are introduced by chimera reads. Comparing the MC of the raw long reads with the MC of HyLight points out that HyLight reproduces MC rates of the raw long reads. The most plausible explanation for this is the fact that HyLight uses overlap graphs, which cannot identify chimera reads, while “short-read-first” approaches, thanks to employing DBG’s, can identify artificial links as mistaken. While there is good hope that overlap graph based approaches are able to identify chimera reads, too, we leave such improvements as promising future work at this point.

Stainberry / StrainXpress
StrainXpress reproduces its results by making use of only the NGS portion of the data, which agrees with that from “Bmock12 ONT”. Again, Strainberry, somewhat unexpectedly, does not achieve competitive performance in terms of strain awareness. Similarly, Strainberry again has considerable drawbacks with respect to (in particular indel) error rates, containing more then 40 times more indel errors than HyLight. Here, also the contiguity of Strainberry’s assembly here is outperformed by that of HyLight (NGA50 – HyLight: 123823; Strainberry: 72377).
NWC ONT
Hybrid assembly approaches
Due to the presence of a higher number of highly similar strains in NWC compared to Bmock12, the advantage of HyLight becomes more pronounced. HyLight outperforms the other hybrid assembly methods, both in terms of Genome Fraction (i.e., strain awareness; HyLight: 95.35%; MetaPlatanus, as second best: 68.54%) and NGA50 (i.e., contiguity; HyLight: 62800; MetaPlatanus, as second best: 20673). Further, although not outperforming the other approaches, HyLight achieves decent indel and mismatch error rates in comparison with the other approaches, ranking second and third, respectively, and, just like most other approaches, reducing the error content of the raw long reads by more than two orders of magnitude. Again, the MC, although not bad, is slightly worse than that of MetaPlatanus and HybridSPAdes, again reflecting that HyLight adopts issues introduced by chimera reads, for which there is good hope that this can be successfully addressed in future work.

Strainberry / StrainXpress
All approaches achieve competitive performance relative to strain awareness, in case of StrainXpress at least from the perspective of comparing it with strain unaware approaches, as was expected, while HyLight, followed by Strainberry both excel. Strainberry has drawbacks with respect to indel error rates, containing approximately 25 times more indel errors than HyLight, while being roughly on par with HyLight in terms of mismatch errors. As usual, StrainXpress considerably trails in terms of contiguity, with Strainberry taking over the lead from HyLight.
NWC PacBioUnlike NWC ONT, this data set is affected by two strains whose long read coverage is extremely low (1.45X and 2.47X, respectively see Supplementary Table 5). The reconstruction of those two strains presents a particular challenge, which points out that this data set is a particularly challenge in one overall. Therefore, all methods produced assembly results that were inferior to those achieved on NWC ONT.
Hybrid assembly approaches
Notwithstanding the level of difficulty of the data set overall, results virtually reproduce the ones achieved on NWC ONT: HyLight outperforms the other approaches both in terms of Genome Fraction (HyLight: 78.94%; MetaPlatanus: 63.86% as second best) and NGA50 (HyLight: 22388; MetaPlatanus: 839 as second best). Error rates are roughly on a par with those of the other approaches, where everyone achieves sufficiently decent results. MC is slightly lower than those of the two best approaches, but considerably better than those of the other two approaches; again, presumably, chimera reads imply that HyLight reproduces the MC rates of the raw reads.

Strainberry / StrainXpress
Just as for Bmock12, StrainXpress reproduces its results because making use of only the NGS portion of the NWC data. Again, Strainberry’s performance in terms of strain awareness drops (here: quite substantially), which is somewhat unexpected, and may be due to reduced quality of the TGS portion of the data here—note that even HyLight somewhat struggles. In comparison to “NWC ONT”, Strainberry’s error rates are improved, containing only approximately 3 times more indel errors than HyLight, with StrainXpress taking the lead in terms of indel errors. Mismatch error rates are similar to those of NWC ONT. In terms of contiguity, HyLight clearly outperforms Strainberry, whose contigs only align with less than 50% of the true genome, which prevents computation of NGA50.
Hybrid versus HiFi: three yeast strainsThe three yeast strains Saccharomyces cerevisiae S288C, CICC-1445, and Saccharomyces pombe FLO-DUT were sequenced with PacBio HiFi, Oxford Nanopore Technologies, and the short-read sequencing technology BGISEQ (2  × ! 150 bp paired reads), which allows to evaluate experiments that compare hybrid approaches (as per their design relying on ONT and BGISEQ) with HiFi assemblers. To control variables, and make sure that we were dealing with low coverage datasets, we subsampled 10X data from each sequencing technology for the assemblies. It is noteworthy that the ONT reads here contained a large number of shorter reads, resulting in an average length only half of that of PacBio HiFi reads, which is not the typical case. Generally, ONT reads are two to three times the length of PacBio HiFi reads. To ensure fairness in subsequent evaluations, we randomly subsampled reads with the requirement that ONT reads should be longer than 10,000 bp and PacBio HiFi reads should be longer than 5,000 bp. Due to the abundance of relatively shorter reads in the ONT data for these three datasets, the resulting 10X ONT subset had an average length close to that of PacBio HiFi.Subsequently, to mimic a scenario akin to a metagenome, we merged the ONT, the BGISEQ and the HiFi reads of the three yeast strains into one data set for each sequencing technology. This established a mock community that consisted of two S.cerevisiae strains (S288C and CICC-1445) as well as a S.pombe strain (FLO-DUT). We ran all assemblers on this mock community type data set. Note that only S288C comes with a (haploid) reference genome (provided via the SRA), in other words only S288C is equipped with a ground truth. To evaluate results in a way that agrees with the conventions of evaluating metagenomes, we fed the contigs of each method into Merqury. We recall that the lack of reference genomes for CICC-1445 and FLO-DUT prevented an evaluation with MetaQUAST.Results are shown in Table 4. Evidently, HyLight outperforms both HiFi assemblers, Hifiasm-meta and MetaMDBG quite substantially, while Hifiasm-meta and MetaMDBG are largely on a par. In summary, the hybrid assembler HyLight proves to be superior over the HiFi only assemblers. Evidently, combining noisy ONT with NGS data appears to be quite preferable over assemblies generated from HiFi data alone: the hybrid approach excels all in terms of strain awareness (Completeness), error content (Error Rate) and potential misassembled contigs (QV).Table 4 Benchmark results are presented for the assembly of a real sequencing dataset comprising reads from 3 yeast strainsTo make sure that these results agree with what one can achieve in terms of evaluating results with respect to an available reference genome, we also ran the contigs pertaining to S288C against its available reference genome. See Supplementary Table 8 for the corresponding results. Numbers confirm the superiority of HyLight’s hybrid assemblies over the HiFi only assemblies all in terms of strain awareness, misassembly and error content. One can also see that the contiguity of the HiFi only assemblies exceeds that of HyLight. This suggests that HiFi assemblies trade length for quality in terms of strain specificity, error and misassembly content. However, since corresponding related results for the other two strains cannot be obtained because of the lack of reference genomes, one cannot be certain about the contiguity of the contigs relating to these two strains, since Merqury can only assess strain awareness and quality in terms of error and misassembly content.Runtime and memory usage evaluationWe evaluated the performance of runtime and peak memory of all methods on the data set containing the 3 Salmonella strains, on a x86_64 GNU/Linux machine with 48 CPUs. The data volume of the NGS (Illumina) reads amounted to 573 MB and the volume of the Pacbio CLR reads amounted to 281 MB. Supplementary Table 7 reports CPU times and peak memory usages of the different hybrid assembly methods. without any doubt, OPERA-MS is the fastest tool: it only takes 2.09 hours and 1.23 GB memory. The runtime of hybridSPAdes, HyLight, and MetaPlatanus is roughly on a par, requiring approximately 5.53, 7.01, and 6.93 hours, respectively. However, in terms of peak memory usage, both hybridSPAdes and HyLight demonstrate significantly lower usage compared to MetaPlatanus, with values of 3.85, 15.99, and 69.26, respectively. Unicycler, on the other hand, requires the longest runtime (53.71 hours).

Hot Topics

Related Articles