Taxometer: Improving taxonomic classification of metagenomics contigs

Improving contig annotations of short-read based metagenomicsTo demonstrate that Taxometer could improve the annotations of different taxonomic classifiers, we trained it on MMseqs23 and Metabuli30 configured to use the GTDB database31 and on Centrifuge5 and Kraken22 configured to use the NCBI database32. For the CAMI2 human microbiome datasets we found that MMseqs2 correctly annotated on average 66.6% of contigs at species level (Fig. 2a). When applying Taxometer, trained on the MMseqs2 annotations, the amount of correct annotated contigs increased to 86.2%. Additionally, when applied to two more challenging datasets, CAMI2 Marine and Rhizopshere, Taxometer increased the MMseqs2 annotation level from 78.6% to 90%, and from 61.1% to 80.9%, respectively (Fig. 2c). This was reflected in F1-score improvements of the annotation of between 0.1 and 0.13 for the human microbiome datasets. When we applied Metabuli, Centrifuge, and Kraken2 to the CAMI2 human microbiome datasets, they correctly annotated on average >94.8% of contigs at species level. Here, Taxometer did not improve on this close-to-perfect annotation level but also did not decrease performance (F1-score absolute change < 0.002). However, when applied to the Marine and Rhizosphere datasets, the performance of the three methods was much lower. For instance, Metabuli provided wrong species annotations for 12.7% and 37.6% of the two datasets, respectively (Fig. 2c, Supplementary Fig. 2). Here, Taxometer reduced the number of wrong annotations to 7.6% and 15.4%, increasing F1-score from 0.87 to 0.88 and from 0.61 to 0.69 respectively. Similarly, for Centrifuge and Kraken2 applied to the Rhizosphere dataset, Taxometer reduced the amount of wrong annotations fromFig. 2: CAMI2 results.Taxonomic classifier annotations and Taxometer results at species level, compared to the ground truth, score threshold 0.95. MMseqs2 and Metabuli returned GTDB annotations, Centrifuge and Kraken2 returned NCBI annotations. The comparisons are to the ground truth labels. (a) CAMI2 Gastrointestinal, (b) CAMI2 Marine and (c) CAMI2 Rhizosphere datasets. Source data are provided as a Source Data file.68.7% to 39.5% (F1-score from 0.22 to 0.27) and from 28.7% to 13.3% (F1-score from 0.64 to 0.68), respectively (Supplementary Fig. 2). Even though Taxometer had high precision for taxonomic annotations it made mistakes and re-annotated a small amount of the correct MMseqs2 species-level annotations incorrectly (1.6%–3.3% on the CAMI2 human microbiome). We also performed the same analysis on the two mock communities: ZymoBIOMICS Microbial Community Standard with 10 strains and ZymoBIOMICS Gut Microbiome Standard with 21 strains (Supplementary Fig. 3, Supplementary Fig. 4). Taxometer improved the quality of taxonomic annotations for both mock communities, except extremely well performing Kraken2 and MetaMaps, where the performance stayed the same after applying Taxometer (F1-score around 0.91). For MMseqs2, the F1-score improved from 0.28 to 0.847 for ZymoBIOMICS gut microbiome standard sample and from 0.623 to 0.889 for ZymoBIOMICS microbial community standard sample. Finally, we investigated the importance of varying the threshold of the annotation score. Here, we found that a threshold score of 0.95 provided a good balance between recall and precision for multi-sample datasets (Fig. 3, Supplementary Fig. 5, Supplementary Fig. 6). Taken together, we found that Taxometer could fill annotation gaps and remove incorrect taxonomic labels of large numbers of contigs from diverse environments, while only mislabeling a small minority of correctly labeled contigs.Fig. 3: Precision-recall curves.Precision-recall curves for predictions at species level, compared to the ground truth, (a) CAMI2 human microbiome, (b) Marine and Rhizosphere datasets. Values given the score threshold 0.95 are marked with a cross sign. Source data are provided as a Source Data file.Using both abundance and TNF improved predictionsGiven the ability of Taxometer to correctly predict new as well as correct wrong annotations we investigated the contribution of abundance and TNFs features for the predictions. We, therefore, trained Taxometer to predict MMseqs2 annotations for the CAMI2 human microbiome datasets using either the abundances or TNFs (Fig. 4a, Supplementary Fig. 7). Here we found that for higher taxonomic levels (phylum to genus) up to 98% of Taxometer annotations could be reproduced by training using only TNFs. This is in concordance with previous findings that TNFs could be used to classify metagenomics fragments at the genus level and that abundance showed better strain-level binning performance compared to TNFs14,33. The number of correct species labels predicted by the model that combined both TNFs and abundances was 18–35% larger than the models that only used TNFs or abundances for MMseqs2 annotations of the CAMI2 Airways dataset.Fig. 4: Analysis of feature importance and novel taxa.a Contribution of abundances and TNFs features to Taxometer performance demonstrated on the CAMI2 Airways short-read dataset. The amount of correctly predicted contigs labels at each taxonomic level using a score threshold of 0.5. b Simulation analysis of unknown taxa. X-axis: Pearson correlation coefficient between the mean feature vectors of the deleted and the assigned species. Y-axis: ratio between the number of contigs of the deleted species (“deleted”) and the number of contigs of the species that was the most prevalent among the incorrectly assigned (“assigned”) in the training set. The color legend shows the share of correctly missing labels, equal to 1 − FP, where FP is the share of false positives. FP is high when the assigned species was more prevalent in the training set and TNFs and abundances are highly correlated between the deleted and the assigned species. Source data are provided as a Source Data file.Since the abundance vector was an important feature for predicting the labels, we investigated if the annotations were still improved if the abudance vector only consisted of one sample. We, therefore, only used the contigs from one sample from each of the 5 human microbiome CAMI2 datasets (Airways, Oral, Skin, Urogenital, Gastrointestinal). We observed that Taxometer still showed a major improvement for the MMseqs2 annotations (F1-score increased from 0.738 to 0.866 for the Airways dataset) and only slightly decreased the performance for the best performing classifiers (with the largest drop in F1-score for the Skin dataset from 0.926 to 0.895), supporting our previous findings when using the multi-sample abundance vector (Supplementary Fig. 8). Thus, using Taxometer was beneficial in both one-sample and multi-sample experiments.Most metagenome binners use the abundance vector as well, so we benchmarked Taxometer against the VAMB binner as a taxonomic labels refinement tool. We ran VAMB on the CAMI2 datasets, which results in contigs classified to bins. For each bin we determined the assigned taxonomy by selecting the majority taxonomic label of its contigs using the Kraken2 taxonomic classifier. We assigned the bin taxonomic label to each contig in this bin. Comparing the assigned labels to ground truth, we determined that taxonomic classification after this procedure was worse than both Kraken2 classification and the Taxometer refinement of Kraken2 classification (86% correctly annotated contigs with the binning approach vs 91% from Kraken2 results) (Supplementary Fig. 9). Thus, binning alone cannot serve as a taxonomic refinement tool, despite its use of the contigs abundances.Novel species were predicted at genus levelTo explore the limitations of Taxometer we investigated the performance when species in the dataset were missing from the database. To achieve this we deleted annotations from five species in the CAMI2 human microbiome MMseqs2 results before training Taxometer. This resulted in removing between 649 and 5127 contig annotations per dataset. As the deleted species were not in the training set, a perfect classifier should assign missing labels to the contigs that belong to this species. In our experiments, Taxometer predicted the correct genus label for all these contigs. However, the share of incorrectly assigned annotations at species level varied between 6% and 82% across the species and the datasets. For these wrong annotations, we found that the number of contigs of the deleted and the assigned species, and the mean feature correlation between them were the most important factors (Fig. 4b). We found that false positives tended to occur when the assigned species were more prevalent in the training set than the deleted species, e.g. in the Airways dataset Staphylococcus aureus was assigned to 1312 of 1879 contigs from the deleted species Staphylococcus epidermidis. That can be explained by Staphylococcus epidermidis being 17 times more prevalent in the training set than Staphylococcus aureus, with 78% of all contigs from the Staphylococcus genus were from Staphylococcus aureus. Second, we found it to make mistakes when TNFs and abundances were highly correlated between the deleted and the assigned species. For instance, in the Gastrointestinal dataset, 183 out of 229 contigs of the deleted Butyrivibrio hungatei species were assigned to Butyrivibrio sp900103635, that only had 8 contigs in total. However, the pearson correlation of the mean feature vectors for these two species was 0.99. Thus, for the contigs of a novel taxon, Taxometer might assign a closely related taxon instead of returning a missing annotation.Taxometer improved recall for annotations on all confidence levelsSome taxonomic classifiers provide an interface to threshold the confidence of the taxonomic labels that they assign. As this will affect the precision of the resulting annotations we tested Taxometer performance for five different confidence level values of the Kraken2 classifier on the CAMI2 datasets (Supplementary Fig. 10). For instance, using the CAMI2 Gastrointestinal dataset and increasing the confidence of Kraken2 resulted in reduced F1-score 0.977 for the confidence value 0.0 to 0.859 for the confidence value 0.25, 0.825 for the confidence value 0.5, 0.726 for the confidence value 0.75, and 0.029 for the confidence value 1.0 (Supplementary Fig. 11). For the Rhizosphere dataset, where Kraken2 showed low precision with the default value 0.0, increasing the confidence value from 0.0 to 0.25 increased F1-score from 0.715 to 0.717. Here, Taxometer improved F1-score for confidence level 0.25 to 0.731. While further increasing the confidence value to 0.5 and 0.75, Kraken2 F1-score dropped to 0.49 and 0.327, but Taxometer prediction stayed almost the same, with F1-score 0.72 for both confidence levels (Supplementary Fig. 12). In conclusion, filtering the output of a classifier by confidence level will result in better precision but lower recall. Here, we showed that Taxometer could improve recall and precision leading to a better F1-score when applied to the filtered output.Taxometer as a benchmarking toolWe were interested in evaluating the performance of Taxometer for the annotation of long-read based metagenomics. In absence of a sufficiently large dataset with ground truth available, we analysed ZymoBIOMICS Gut Microbiome Standard with 21 strains and one sample. Despite an abundance vector only consisting of a single number, Taxometer improved the F1-score of MMSeqs2 classifier from 0.28 to 0.854, and made a 0.0-0.05 improvement for other classifiers. However, this dataset did not reflect real world data size and complexity. We, therefore, investigated if the consistency of Taxometer annotations could be used as a measure of classifier performance when ground truth labels were missing. Because metagenomics binning has been able to generate hundreds of thousands of MAGs, TNFs and abundances carry a strong signal for contigs of the same origin17. Thus, we hypothesized that the ability of Taxometer to predict classifier annotations could reflect the performance of the classifier. Specifically, the more incorrect labels a classifier assigns to contigs of the same origin, the harder it will be for Taxometer to reproduce the labels assigned by a classifier. Therefore, inconsistent annotations by a classifier will decrease the score that Taxometer assigns to any taxa in the dataset. Thus, Taxometer scores can be used as a proxy for classifier consistency.To investigate this we acquired annotations of four classifiers for the seven CAMI2 datasets, ZymoBIOMICS Microbial Community Standard and ZymoBIOMICS Gut Microbiome Standard, as well as the additional MetaMaps classifier for ZymoBIOMICS Gut Microbiome Standard, resulting in 37 sets of taxonomic labels. We divided each dataset into five folds and trained Taxometer to predict the annotations of each classifier five times using a new fold as the validation set. We then compared how well Taxometer predictions corresponsed to the annotations of the particular classifier and the ground truth (Fig. 5a, c, Supplementary Fig. 13, Table 1). We found that Taxometer and classifier precision metrics were correlated with Spearman Correlation Coefficient of 1.0 across 3 out of 9 datasets, reflecting perfect ranking, and >0.78 for 8 out of 9 datasets, which is equivalent to maximum of only one misranked classifier (Fig. 5b, Supplementary Fig. 14, Table 1). We also noticed that the dataset for which the Taxometer precision ranking was mostly incorrect (negative Spearman correlation) was a Microbial community dataset with only 303 contigs and only 1 sample. Thus, in the absence of ground truth labels the precision of Taxometer prediction for classifier labels could be used as a benchmark for different taxonomic classifiers within a dataset.Fig. 5: Taxonomic profilers benchmarks and analysis of long-read datasets.a K-fold evaluation description. Taxometer predictions are compared to the classifer annotations, not the ground truth labels. b An example of the number of true positives, false positives, and false negatives used in the k-fold evaluation, species level for the Rhizosphere dataset, MMseqs2 and Metabuli classifiers. The total number of contigs for Taxometer predictions equals the number of annotations initially returned by a classifier. c, d Real long-read datasets k-fold evaluation for Human Gut and Sludge datasets. Source data are provided as a Source Data file.Table 1 Spearman correlations between classifiers precisions evaluated on ground truth and Taxometer precisions evaluated with cross-validationBenchmarking contig annotations of long-read datasetsWe then identified two PacBio HiFi read datasets of complex metagenomes; 4 samples from the human gut microbiome and 3 samples from an anaerobic digestion reactor sludge34. These datasets did not have ground truth labels and when we applied either GTDB- or NCBI-based classifiers they disagreed for 28%–39% of the contigs at species level annotations (Supplementary Fig. 15). Therefore, we applied the k-fold evaluation scheme described above to use Taxometer precisions for benchmarking the classifiers. MetaMaps had the highest precision of 0.95, but also the lowest share of annotated contigs on the species level (16%, while MMseqs2 annotated 49%). MMseqs2 had the second highest precision, i.e. 0.84 for the human gut dataset (Metabuli 0.68, Centrifuge 0.62, Kraken2 0.68) (Fig. 5d, Supplementary Fig. 16a). The precision and recall values for the MMseqs2 classifier were within the range of values for the CAMI2 datasets ([0.8,0.95] for precision and [0.6,0.85] for recall) (Supplementary Fig. 16b). This was consistent with the performance of the classifiers on the most difficult CAMI2 dataset, Rhizosphere, and thus we concluded that MMseqs2 returned the most precise annotations for both the human gut microbiome and sludge datasets compared to any of the other classifiers.

Hot Topics

Related Articles