refMLST: reference-based multilocus sequence typing enables universal bacterial typing | BMC Bioinformatics

Below, we detail the various steps required for the implementation of refMLST from reference loci identification, typing against a query genome, and combining typing data across genomes to identify outbreak clusters. An overview of this process is provided in (Fig. 1). The query genome for refMLST is a genome assembly in FASTA format which may be generated from whole genome or metagenomic sequencing; however, refMLST requires that the genome is high quality (e.g. if a mixed isolate was sequenced, imperfect taxonomic binning, or strain duplication). refMLST automatically runs BUSCO (v5.3.2) on query genomes to assess completeness and contamination, and will only process those with greater than 90% unfragmented single copy orthologs in accordance with MIMAG standards [13, 14]. All other genomes are flagged for review.Fig. 1Overview of refMLST workflow. The input/outputs are represented by parallelograms while rectangles reflect processes and diamonds represent decision pointsUnlike multilocus sequence typing, cgMLST and wgMLST, refMLST does not require a curated scheme of alleles—it only requires a single reference genome for the pathogen under study, facilitating analysis of any bacterial species. The reference genome in Genbank [15] format is used to perform locus (CDS) identification and extraction from the query genome; this file may be sourced from NCBI for any available genome. For novel pathogens or for use of a reference genome with greater sequence homology to outbreak strains, this file can be annotated from sequenced reference (any sample from outbreaks) using standard genome annotation pipelines. The reference file is parsed to extract coding sequences (CDS), where each CDS represents a reference locus. While reference genomes of closely related species could be used with refMLST if no reference is available for the same species, we limit our analysis below to the use of reference genomes within the same species, preserving the number of CDS shared between query and reference genomes.Loci on mobile genetic elements may have a different evolutionary history from the bacterial chromosome, comprise a significant component of the accessory genome, contribute to reference bias, and ultimately confound analysis; they are therefore excluded from analysis based on annotation in the Genbank file. As outbreaks of antimicrobial resistance or other important phenotypes may be caused by plasmids, users are encouraged to combine the results of refMLST for host bacterial typing with dedicated plasmid typing tools for a holistic investigation of outbreaks [16]. Additionally, refMLST omits loci less than 200 base pairs in length in accordance with other gene-by-gene analysis tools [17].To call alleles in the assembled genome of interest, locus sequences from above are rapidly aligned against the query genome assembly using minimap2 (v2.24) with the following command [18]:
minimap2 -a –MD -N 1 –end-bonus 5 query_assembly.fna cds_sequences.fna
Minimap2 can align nucleic acid sequences with up to 20% sequence divergence, enabling identification of genes beyond the expected species boundary of 95% absolute nucleotide identity. We opted not to use protein aligners such as DIAMOND [19] or MMSeqs2 [20] as the former requires the query assembly to be translated into amino acid sequence, while the latter cannot account for frameshifts. Alignments of loci against the genome assembly are filtered for multiple hits to exclude paralogous loci. To ensure quality locus alignments, reference loci must align end to end, and newly discovered alleles must be within 20% of the size of the reference allele. Next, a CRC32 hash [21] is taken of the allele sequence in the query genome to convert it into a compact, easy to understand number.Evolutionary allele distances between isolates are calculated with a Hamming distance, ignoring missing alleles. The aforementioned genome quality control ensures that small inter-genomic distances are not a result of missing loci in either genome when performing pairwise comparisons.Cluster addresses are calculated by iterating through query genomes in the order at which they were sequenced/submitted, performing single-linkage hierarchical clustering. Reproducible cluster addresses reflecting inter-sample allele distances are generated; description of similar cluster addresses has previously been published [11, 22]. In brief, cluster addresses are composed of seven digits (“allele addresses”), where each digit reflects a distance threshold (1000, 200, 100, 50, 20, 10, 5 alleles). Addresses are read from left to right, and genomes with greater genomic similarity will share more cluster address digits in common. An example of cluster addresses and their utility is provided in (Table 1). Unlike other address generation mechanisms, and because typing of each genome is deterministic with refMLST, cluster addresses are deterministic and stable, and therefore do not need matching across real-time analyses as an outbreak grows [11, 23]. All samples are output in a distance matrix and line list with cluster codes.Table 1 Example of refMLST cluster address outputTo assess refMLST performance against other gene-by-gene approaches, namely the cgMLST and the wgMLST, we applied refMLST on data collected from different bacterial outbreaks caused by S. enterica [11], Y. enterocolitica [24], and C. jejuni [25]. For Y. enterocolitica, and C. jejuni, genomic data for 331 and 6526 isolates, respectively, were recovered from public databases with their corresponding whole and core allelic schemes [24, 25]. For S. enterica, 1263 isolates with publicly available sequencing data were collected with their corresponding core allelic schema for the cgMLST analysis and results were compared to those generated by chewieSnake (v3.0.0), a recently published, hash-based cgMLST tool [11]. Meanwhile, for the wgMLST based comparison, genome assemblies recovered from [11] were processed through the chewBBACA (v3.3.0) standard workflow to generate the required whole genome schema and the allelic profile. Briefly, all the 1263 genomes were used to create the schema and the allele call using the chewBBACA CreateSchema.py script with default settings. The quality of the loci has been assessed using the schema evaluation script from the chewBBACA suit and paralogous genes were removed alongside loci with single alleles. Cgmlst-dists (v0.4.0) has been used to generate the final wgMLST distance matrix [26].

Hot Topics

Related Articles