SpeciateIT and vSpeciateDB: novel, fast, and accurate per sequence 16S rRNA gene taxonomic classification of vaginal microbiota | BMC Bioinformatics

Estimates of classification accuracy for novel sequences were obtained using tenfold cross validation. To ensure confidence in assignments, SpeciateIT imposes model-specific classification error thresholds: when the posterior probability of a query sequence does not exceed this threshold, the query sequence is classified as the next highest taxonomic level at which this threshold requirement is met. In the case of novel taxa, SpeciateIT is expected to assign higher-level classifications. In tenfold cross-validation testing, 98.7, 97.6, and 97.2% of sequences from “known” species (a species with at least 1 sequence present in the training dataset) were correctly assigned with > 90% of assignments made to the genus or species levels (Fig. 1A). For sequences from “novel” species (those with no sequences present in the training set), 60–70% were correctly assigned to their respective taxonomic categories, with the accuracy varying depending on the targeted region. This highlights the efficacy of SpeciateIT in accurately classifying bacterial taxa using higher order Markov chain models.Fig. 1A Ten-fold cross validation of the vSpeciateDB V1V3, V3V4, and V4 models demonstrated exceptional classification of sequences from “Known Species” with at least 1 sequence present in models. Most sequences from “Novel Species” were correctly classified at some taxonomic level. B The posterior probabilities of query sequences from “Novel Species” tended to be higher for correct classifications relative to incorrect classificationsAn essential aspect of SpeciateIT is its provision of posterior probabilities for query sequences. In the context of classification using Markov chain models, posterior probabilities represent the likelihood or confidence that a given sequence belongs to a particular category or class. These probabilities are calculated based on the observed sequence data and the parameters of the Markov chain model. When a query sequence has a lower posterior probability, it suggests that the observed sequence data is less consistent with the model’s parameters. This can indicate that the sequence deviates more from the typical patterns captured by the model, potentially suggesting a poorer match between the sequence and the model. However, it’s important to note that a lower posterior probability does not necessarily mean that the classification result is incorrect or that the sequence is not related to the modeled categories. It simply suggests lower confidence in the classification result. In some cases, a sequence with a lower posterior probability may still be correctly classified, especially if the model captures only part of the variability present in the data. Regarding sequences from novel species (those absent from the training set), cross-validation results illustrate that the posterior probabilities from correct genus-, family-, or order-level assignments tend to be greater than incorrect classifications (Fig. 1B).To compare classification of vaginal microbiota using SpeciateIT with vagina-specific vSpeciateDB to other popular classifiers and reference sets (RDP Naïve Bayesian Classifier stand-alone Bioconda version 2.13, default settings; DADA2 implementation of RDP Naïve Bayesian Classifier trained with vSpeciateDB, SILVA v138.1 and GTDB r86 reference sets), we classified independent sequences from GTDB (not included in the production of vSpeciateDB) truncated to each variable region and included those from the 100 most abundant species detected in the vaginal microbiota [22]. SpeciateIT with vSpeciateDB provided more species-level assignments than other methods including the DADA2 implementation of the RDP classifier which provided species level assignments, when possible (function: addSpecies) (Fig. 2).Fig. 2SpeciateIT outperforms other classification methods by providing correct species-level assignments. Dashed lines indicate total number of sequences testedThe speed of SpeciateIT is incomparable because of its novel model tree-based approach which directs query sequence classification from the top of the tree (Root) to the branch or node of its final classification (Fig. S4). Classification speed was measured on a 2021 Macbook Pro with an Apple M1 Max processor and 64G RAM using each amplicon reference training set sampled to 101–107 sequences and processed on one core. We compared the speed of SpeciateIT classification to the RDP Naïve Bayesian Classifier (stand-alone Bioconda version 2.13, default settings). SpeciateIT classified 1 million sequences in 3, 2, and 1 min for the V1V3, V3V4, and V4 classifiers, respectively (Fig. 3). Speed is dependent on the number of models read for each classifier (the V4 classifier represents fewer species and therefore contains fewer models). Comparatively, the RDP Classifier classified 1 million V1-V3, V3-V4, and V4 sequences in 66, 57, and 32 min, respectively.Fig. 3SpeciateIT is faster than the RDP classifier when datasets are greater than 1000 sequencesThe performance of any classifier is entirely dependent on the quality of the sequence training set used to build it. Currently, SpeciateIT models have been built from full length 16S rRNA gene sequences curated from the Genome Taxonomy Database (GTDB) for the taxonomy-adjusted V1-V3, V3-V4, and V4 amplicon sequence regions for vaginal microbiota, and are publically available (https://github.com/ravel-lab/speciateIT). The full-length database comprises 2224 species, 497 genera, 77 families, 36 orders, 16 classes, and 14 phyla.One recent change in the field of vaginal microbiota is the expansion of Gardnerella vaginalis to multiple species. Eleven species are represented in the genus Bifidobacterium in the GTDB SSU rRNA reference sequence set from which vSpeciateDB sequences originated. We chose to maintain the Gardnerella annotation for these species because of the vast clinical context surrounding Gardnerella. G. vaginalis C was not included in the final training sets because no reference sequences contained the V3 or V4 regions. Gardnerella vaginalis A and Gardnerella vaginalis F were distinct from other Gardnerella species in both the V2 and V4 regions (Fig. S5a). It was not possible to confidently distinguish other Gardnerella species at any targeted region. To maintain simplicity, one Gardnerella model (“G. vaginalis”) represents GTDB species: G. leopoldii, G. piotii, G. swidsinskii, G. vaginalis and G. vaginalis A, B, C, D, E, F, and H combined. Of other prevalent species in the vaginal microbiota, Lactobacillus iners, L. jensenii, L. mulieris, and “Ca. Lachnocurva vaginae” were distinct in vSpeciateDB while L. gasseri and L. paragasseri were not distinguishable at any region and are referred to as only L. gasseri. Notably, L. crispatus and L. acidophilus were indistinguishable at the V4 region (Fig. S5b). Because L. crispatus is arguably more prevalent in the vaginal microbiota, these models are referred to as L. crispatus.Lastly, the VAginaL community state typE Nearest CentroId classifier (VALENCIA) uses reference centroids representing microbiota compositions for each CST. The taxonomic annotations used in building the reference centroids are integral to correct CST classification. Because vSpeciateDB-based taxonomic assignments differ from those used in the current version of VALENCIA reference centroids, we have produced reference centroids based on vSpeciateDB taxonomy and compatible with the VALENCIA algorithm for CST assignment (Fig. S6).

Hot Topics

Related Articles