Benchmarking bacterial taxonomic classification using nanopore metagenomics data of several mock communities

Performance evaluation of the different taxonomic classifiersDNA-to-DNA methodsThe evaluated DNA-to-DNA methods consisted of Kraken2, Bracken, Centrifuge, KMA and CCMetagen. Results described below are at species level (results at genus level are available in the Supplementary). As Bracken and CCMetagen are companion tools building upon the output of Kraken2 and KMA, respectively, their results are separately presented in the next paragraph. Kraken2, Centrifuge and KMA demonstrated low to very low precision for all datasets, i.e., a considerable number of species not present in the DMCs were predicted (Fig. 1A). Although the precision of KMA was low, its median precision was considerably higher (0.216) compared to Kraken2 (0.018) and Centrifuge (0.010). In contrast, all classifiers exhibited high recall, i.e., few false negative species were observed, and the majority of expected species were detected (Fig. 1B). A median recall of 1 was observed for all three classifiers. Moreover, with the exception of the three StrainMad and Zymo_D6331 datasets, the recall of all three classifiers was 1 for the other individual datasets. Although the recall of KMA was slightly lower than Centrifuge and Kraken2, the higher precision of KMA resulted in the highest median F1 score (0.352), followed by Kraken2 (0.035) and lastly Centrifuge (0.019) that introduced much more FPs than Kraken2 (in some samples more than twofold) (Fig. 1C). The L1 distances between all three classifiers were very similar, with a median L1 distance for Centrifuge, KMA, and Kraken2 of 0.667, 0.662, and 0.674, respectively (Fig. 1D).Fig. 1Performance evaluation for the different classifiers aggregated over all DMCs (generated with the R9 technology) at species level. Each subplot represents a performance metric with panels A, B, C, D and E showing precision, recall, F1, L1, and AUPRC, respectively. For each subplot, the y-axis displays the metric value and the x-axis the different classifiers. For every classifier, the metric values of all datasets are summarized in a boxplot with the median value as horizontal line. Individual dots represent specific values for the different DMCs (dots can be superimposed upon each other if the same value was observed). Outliers are denoted by dots enclosed in a black circle. The legend in the lower right panel corresponds to the DMC identifiers presented in Table 1.CCMetagen, a companion tool to KMA that applies post-filtering, had a noteworthy high median precision (0.933). The post-filtering steps removed many FPs, substantially increasing precision compared KMA (0.216), but also unintentionally removed TPs, resulting in a decreasing median recall (0.600) compared to KMA (1). This was most notably observed in datasets with a staggered or logarithmic composition for which the predicted relative abundances of some FPs were close to those of actual TPs, rendering it difficult to separate both. Therefore, CCMetagen performed worst in terms of recall of all DNA-to-DNA methods, but still displayed the highest median F1 score of all DNA-to-DNA classifiers (0.706). CCMetagen had a slightly higher median L1 distance (0.741) compared to Kraken2, KMA, and Centrifuge, because the smaller number of FPs increased the L1 distance but the higher number of FNs increased the L1 distance.Bracken, a companion tool to Kraken2, re-distributes reads classified at higher taxonomic levels to either the genus or species levels. As Bracken does not introduce or remove new genera or species that were not yet detected by Kraken2, scores such as precision, recall, and F1 will not be altered by Bracken but rather the relative abundances of the detected genera and species are recalculated based on reads assigned to a higher rank. However, the L1 distance differences of Bracken compared to Kraken2 were often very limited. For some samples, such as BeiRes_276, Zymo_D6300, and Zymo_D6310, there was a decrease in L1 distance, and for some samples, such as the three StrainMad, Zymo_D6322 and Zymo_D6331, an increase was observed. This resulted overall in a marginal increase of the median L1 value of Bracken (0.673) compared to Kraken2 (0.667). Bracken, hence, did not exhibit a substantial difference of the relative abundances for the analyzed samples.DNA-to-protein methodsThe evaluated DNA-to-protein methods consisted of Kaiju and MMseqs2. Results described below are at species level (results at genus level are available in the Supplementary). Similar to DNA-to-DNA methods, both classifiers displayed only very low precision (Fig. 1A). Kaiju introduced more FPs than MMseqs2, resulting in a lower median precision (0.010) compared to MMseqs2 (0.060). Similar again to DNA-to-DNA methods, both methods displayed very high recall. However, MMseqs2 exhibited more FNs than Kaiju for multiple samples, resulting in lower median recall for MMSeqs2 (0.900) compared to Kaiju (1) (Fig. 1B). The median F1 score of MMseqs2 (0.113) was higher than Kaiju (0.021) (Fig. 1C), due to the pronounced higher precision of MMseqs2 compared to Kaiju. Notwithstanding, the F1 score of MMseqs2 remained substantially lower compared to KMA (0.352). Both DNA-to-protein classifiers generally exhibited worse abundance estimations than DNA-to-DNA classifiers with higher L1 distances, with MMSeqs2 (1.124) exhibiting a worse median L1 distance than Kaiju (1.059) (Fig. 1D).DNA-to-marker methodsThe evaluated DNA-to-marker methods consisted of MetaPhlAn3 and mOTUs2. Results described below are at species level (results at genus level are available in the Supplementary). mOTUs2 displayed a substantially higher median precision (1) compared to MetaPhlAn3 (0.381) (Fig. 1A). MetaPhlAn3 displayed a large spread in precision over the different DMCs. The samples that exhibited the lowest precision were those with few species and a staggered or logarithmic composition. Overall, both DNA-to-marker methods consequently performed substantially better in precision compared to DNA-to-DNA and DNA-to-protein methods, excluding CCMetagen (0.933) that achieved a higher precision compared to MetaPhlAn3. However, recall values for both MetaPhlAn3 (0.645) and mOTUs2 (0.600) were also the lowest of all evaluated methods, excluding CCMetagen (Fig. 1B). Because MetaPhlAn3 and mOTUs2 employ different underlying databases that could not be harmonized, the introduction of FNs was however not solely dependent on the classifier’s capability, but also on the presence of the ground truth in their underlying reference databases. Investigation of the underlying databases indicated that mOTUs2 contained fewer taxa from the ground truth in two DMCs and more taxa in one DMC (see Table S1). mOTUs2 had a higher F1 score (0.733) compared to MetaPhlAn3 (0.516) (Fig. 1C), since mOTUs2 had the highest precision and comparable recall to MetaPhlAn3. Consequently, the F1 scores of DNA-to-marker methods were the highest compared to both DNA-to-DNA and DNA-to-protein methods, with again the notable exception of CCMetagen. The L1 distances for MetaPhlAn3 (0.817) and mOTUs2 (0.575) had a substantial difference between each other (Fig. 1D). Notably, mOTUs2 emerged as the classifier with the best L1 distance.Relative abundance threshold filteringArea under the precision-recall curveOverall, DNA-to-DNA and DNA-to-protein methods displayed high to very high recall, but suffered from very low precision, drastically reducing their F1 scores, whereas DNA-to-marker methods displayed medium recall but very high precision, resulting in overall the best F1 scores (Fig. 1). Since classifier performance can be increased by setting an abundance threshold to remove FP predictions, albeit at the cost of increased FNs, PR plots were calculated for all classifiers (see Reports)64. The resulting AUPRC values at species level are presented in Fig. 1E (results at genus level are available in the Supplementary). Median AUPRC values were the lowest for the DNA-to-marker methods MetaPhlAn3 (0.533), mOTUs2 (0.600), and the DNA-to-DNA method CCMetagen (0.523). This can likely be explained because recall values of these classifiers were the lowest of all considered categories whereas precision values were the highest, so that further filtering could only reduce recall values with little effect on precision. DNA-to-protein based methods displayed an intermediate effect for both Kaiju (0.647) and MMSeqs2 (0.672), indicating a mildly positive effect of abundance filtering. Lastly, DNA-to-the DNA methods Kraken2 (0.830), Bracken (0.829), Centrifuge (0.838), and KMA (0.789) displayed the highest AUPRC values, excluding CCMetagen. This indicated a marked positive effect of relative abundance threshold filtering for DNA-to-DNA methods with respect to other methods. CCMetagen was an exception because this method performs heavy filtering by default and therefore behaves more similar to DNA-to-marker tools. For all methods, there existed a marked effect of the considered samples on AUPRC values, as expected, since samples with fewer organisms and an even composition exhibited better AUPRC scores. For such samples, it was easier to find a threshold that removed many FP, alleviating the low precision of these methods, without an associated cost of decreasing recall.Effect of abundance filtering on precision, recall and F1As thresholds changed during filtering, precision and recall values also changed. An example is the Zymo_D6300 dataset for Kraken2 (1) and KMA (0.744) with different AUPRC values at species level. Kraken2 became the perfect classifier with a precision and recall of 1 when a filtering threshold of 2.5% was applied. Conversely, whereas KMA exhibited increased precision in the initial filtering thresholds, its precision experienced a rapid decline as the filtering threshold continued increasing due to a FP with a substantial relative abundance. Hence, although AUPRC values indicated DNA-to-DNA methods benefited from increased filtering, finding balanced filtering still requires evaluating precision, recall, and F1 scores at different thresholds to select suitable thresholds for the different classifiers. Figure 2 displays the general trends of precision, recall and F1 at varying thresholds in steps of 0.05% for all classifiers at species level from 0% to 1.20% (results at genus level are available in the Supplementary). As expected, the precision benefitted from increasing relative abundance filtering thresholds, whereas recall was punished, although trends could differ between individual classifiers.Fig. 2Precision, recall and F1 for the different classifiers when filtering is applied at species level. The first, second and third row represent precision, recall and F1 score, respectively, and each column displays a different classifier. The x-axis of every subplot represents the applied filter threshold for which all species below this threshold were considered as absent, and the y-axis displays the metric value. Each subplot contains three shades of color with the darkest shade showing the median, the medium shade showing the IQR, and the brightest shade showing the minimum/maximum values over all nine R9 DMCs.All DNA-to-DNA classifiers had their steepest increase in median precision before a threshold of 0.5%, but the slope of the increase could differ between classifiers, with KMA exhibiting a notably steeper slope compared to Kraken2, Bracken, and Centrifuge. Additionally, both the final maximum median precision and the filtering threshold at which it was reached, could differ between classifiers. The maximum precision of Kraken2 (1), Bracken (1), KMA (0.963) and Centrifuge (1) was reached at a threshold of 0.45%, 1.05%, 0.9% and 0.65%, respectively. However, recall values dropped very fast with increased filtering. At the threshold where DNA-to-DNA classifiers reached their maximum precision, their median recall had decreased drastically. Using F1 scores as a balanced metric for both precision and recall, F1 values experienced the steepest increase up to a threshold of 0.05%, after which the increase slowed down or even decreased, suggesting this to be a well-balanced cutoff for DNA-to-DNA methods. CCMetagen was an outlier for DNA-to-DNA methods as this classifier inherently already performs filtering so that further filtering barely made a difference in precision but decreased recall fairly quickly. Although CCMetagen without filtering scored best in F1 scores compared to other DNA-to-DNA methods, even at very low filtering values, the other DNA-to-DNA methods surpassed CCMetagen, suggesting that the default filters applied to CCMetagen are potentially too strict and should be relaxed. While precision similarly increased for DNA-to-protein methods, its increase was much less steep. Both DNA-to-protein methods had their steepest increase before 0.2% with a similar slope. A maximum median precision of 1 was reached at high filtering thresholds of 2.3% and 1.85% for Kaiju and MMSeqs2, respectively, however similar to DNA-to-DNA methods at substantial costs in recall that were more pronounced for Kaiju. Using F1 scores as a balanced metric for both precision and recall, Kaiju and MMSeqs2 reached their best F1 scores at different filtering thresholds of 0.1% and 0.05%, respectively. Although the steepest increase for MMSeqs2 was before 0.05%, its F1 score still increased at 0.1% without a decrease in recall, suggesting 0.1% to be a well-balanced filtering threshold. Notwithstanding, it appeared that even with tailored filtering thresholds, DNA-to-DNA methods outperformed DNA-to-protein methods because their precision could generally be increased without an as drastic drop in their recall.Lastly, DNA-to-marker methods similarly displayed increasing precision but with marked differences between mOTUs2 and MetaPhlAn3. mOTUs2 already exhibited a median precision of 1 without any additional filtering, whereas the precision of MetaPhlAn3 benefitted greatly from additional filtering reaching a maximum of 0.917 at a filtering threshold of 2.7%. Recall values declined faster for MetaPhlAn3 than for mOTUs2 with additional filtering. This was reflected in their F1 scores, which suggested filtering thresholds of 0.1% for MetaPhlAn3 and no filtering threshold for mOTUs2.Assessment of overall classifier performanceA summary of the performance of all classifiers at species level is presented in Fig. 3A (results at genus level are available in the Supplementary), displaying precision and recall along with their interquartile ranges (based on values obtained over all DMCs) represented as error bars, illustrating a clear distinction between three main groups. The first group contains DNA-to-DNA and DNA-to-protein classifiers, excluding CCMetagen, in the top left corner characterized by high recall but low precision. Within this group, KMA had the best precision. Although its precision exhibited more fluctuation based on its IQR, its lower boundary was still higher than the highest IQR boundary of other classifiers within this group. All classifiers scored a median recall value of 1, except for MMseqs2, although it did reach a recall of 1 for some datasets. The recall IQRs of classifiers, excluding MMSeqs2, were hence similar within this group. The second group consists solely of MetaPhlAn3, which resides in a central position characterized by medium recall and precision. MetaPhlAn3 displayed the highest IQR interval for its precision of all classifiers. Recall was lower compared to the first group, partly explained by missing taxa in the underlying reference database (see Table S1). However, should these missing taxa have been present and correctly detected, MetaPhlAn3 would still have missed more species than the classifiers in the first group (see Table S2) because many species with very low relative abundances were missed. The third group consists of CCMetagen and mOTUs2, residing at the middle right position characterized by high precision but medium recall. Both classifiers exhibited the lowest median recall and largest IQR for recall values among all classifiers. Although mOTUs2 obtained the highest precision close to 1 for all datasets, it experienced the same issue as MetaPhlAn3 with ground truth species being absent in its underlying reference database (see Table S1), having a profound negative impact on recall. However, even if those taxa had been present in the database and detected, the amount of FNs would still have been higher than for other classifiers (see Table S2). CCMetagen, on the other hand, relies on heavy post-filtering of KMA results, increasing precision to very high values but removing too many TPs in the process, especially in datasets with a staggered composition, incurring a heavy penalty in recall.Fig. 3Overall median precision and recall values at species level for the different classifiers. The dots in panel A represent the median precision (x-axis) and recall (y-axis) values for every classifier aggregated over all nine DMCs, while the error bars indicate the extent of the IQR for both the precision and recall. The dots in panels B and C similarly indicate median precision (x-axis) and recall (y-axis) values for every classifier aggregated over all nine R9 DMCs, but with error flags indicating the updated median precision and recall for an abundance filtering threshold of 0.05% and 0.1%, respectively. Classifiers are colored according to the legend on the lower right of plot C. Abbreviations: DMC (Defined mock community); IQR (Interquartile range); PR (Precision recall).Figure 3B,C illustrate the effects on precision and recall at species level using filtering thresholds of 0.05% and 0.1% (results at genus level are available in the Supplementary), displaying the effects of filtering as error bars. For the first group, precision increased strongly at 0.05%. Expanding the threshold to 0.1% led to a further increase in precision, albeit to a lesser degree compared to the initial 0.05% threshold. Recall decreased similarly for both thresholds, with the decline being less pronounced for DNA-to-protein methods compared to DNA-to-DNA methods, in agreement with the suggested filtering thresholds of 0.05% and 0.1% for DNA-to-DNA and DNA-to-protein methods (see section Effect of abundance filtering on precision, recall and F1). The second group showed a small increase in precision for a threshold of 0.05% and a bigger increase for a threshold of 0.1%. The associated drops in recall were much less pronounced than for the first group, in agreement with the suggested filtering threshold of 0.1% for MetaPhlAn3 (see section Effect of abundance filtering on precision, recall and F1). Lastly, the third group did not demonstrate any further increases or decreases in both precision and recall when filtering thresholds were increased, in agreement with the suggestion that no filtering should be employed for mOTUs2 and CCMetagen (see section Effect of abundance filtering on precision, recall and F1).Evaluation of classification performance using a single ONT R10 DMCFigure 4 presents results for classification performance of all classifiers compared to the R9 and R10 datasets of sample Zymo D6322 at species level (genus level results are available in the Supplementary). For most classifiers, there is no substantial difference in absolute precision when considering both datasets. Only CCMetagen exhibited a notable decline in absolute precision for the R10 dataset, with an absolute decrease of 0.111. However, in relative precision, the R10 dataset showed a substantial decrease for CCMetagen (−11.11%), Centrifuge (−15.78%) and MMseqs2 (−23.79%), whereas a relative precision increase was observed for Kraken2/Bracken (+2.56%), KMA (+8.45%), Kaiju (+26.43%), and MetaPhlAn3 (+16.67%). The precision of mOTUs2 remained the same in both datasets. The notable difference in absolute precision for CCMetagen stems from the low count of FPs in the R9 dataset. Consequently, the introduction of additional FPs in the R10 dataset substantially affected precision for CCMetagen, unlike other classifiers, which already had a higher FP count in the R9 dataset. In contrast, there were no differences in FNs between the R9 and R10 datasets so that the recall for all classifiers remained the same. Consequently, F1 score differences between the R9 and R10 datasets mirrored trends observed for precision with the R10 dataset showing a relative F1 score decrease for CCMetagen (−5.88%), Centrifuge (−15.67%), and MMseqs2 (−22.98%); a relative F1 score increase for Kraken2/Bracken (2.51%), KMA (7.60%), Kaiju (26.09%), and MetaPhlAn3 (12.50%); and the same F1 score for mOTUs2. Note however that the employed R9 dataset of sample Zymo D6322 had a relatively high quality compared to other R9 datasets (see Supplementary Figures S8, S10–S17). This higher quality of the R9 Zymo D6322 dataset was however not an isolated case, as samples Bei Resources HM-277D (Supplementary Figure S11) and Zymo D6331 (Supplementary Figure S17) had comparable read quality distributions to R9 Zymo D6322 (Supplementary Figure S8), demonstrating the variability of nanopore sequencing.Fig. 4Metric values at species level for the R9 and R10 dataset of Zymo D6322. The dots in panel A, B and C represent the precision, recall and F1 values (left axis), respectively, for every classifier (lower axis) of both the R9 dataset and R10 dataset of the DMC Zymo D6322. Dots can be superimposed upon each other if (nearly) identical values were observed. The bars in each panel present the relative percentage change (right axis) from the R9 to R10 metric value.

Hot Topics

Related Articles