PathoTracker: an online analytical metagenomic platform for Klebsiella pneumoniae feature identification and outbreak alerting

Composition and construction of the databaseWe constructed PathoTracker, a K. pneumoniae database, for CMg samples to facilitate stain-level feature analysis. A total of 341 K. pneumoniae from the China CRE-Network research23,24 were sequenced using 150 bp paired-end Illumina NovaSeq and PacBio RS II or Nanopore and included in this database. The reads were trimmed using Trimmomatic v0.3940 with “ILLUMINACLIP: TruSeq3-SE:2:30:10, SLIDINGWINDOW:4:20, and MINLEN:50” and inspected using FastQC v0.11.5 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Hybrid assembly was conducted for all isolates using Unicycler v0.4.741 with the ‘normal’ mode. All genomes were annotated using Prokka v1.13.742, the K serotype, MLST, ARGs and virulence genes were analysed using Kleborate v0.3.035,43. For all newly-sequenced isolates, we measured the MIC of 23 antimicrobials via broth microdilution, namely MEM, IMP, ertapenem, cefoxitin, cefepime, ceftazidime, ceftazidime/avibactam, ceftazidime/clavulanic acid, cefotaxime, ceftriaxone, cefoperazone/sulbactam, piperacillin/tazobactam, amikacin, ciprofloxacin, levofloxacin, meromycin, colistin, TGC, chloramphenicol, fosfomycin, aztreonam, aztreonam/avibactam, and cefotaxime/clavulanic acid (Fig. 1a).We downloaded all K. pneumoniae genomes from the RefSeq public database and selected 847 isolates with long-read sequencing data included in PathoTracker (as of June 2022, largest_contig > 5 M bp, whole genome, including plasmids). Among them, data from the public database that are traceable back to the location and time of submission were marked and sourced from the NCBI summary file (ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS). If only the year of the strain collection was available, 1 July was used. The database is listed in Supplementary Data 1. Among them, 246 ST11-type isolates were selected and included in the Chinese ST11 PathoTracker database targeted ST11 isolates (Supplementary Data 2).Public antibiotic resistance (ResFinder44) and virulence gene (VFDB45) databases were downloaded for comparison and output together in the results.Pre-phylogenetic analysis and multilevel database architectureTo divide the database based on evolutionary relationships and facilitate rapid subsequent comparisons, the inferred maximum likelihood phylogenetic tree of the entire database was constructed using IQ-TREE v2.2.022 with “-b 100 -m GTR + G + I” on a core genome sequence alignment generated by Roary v1.7.821. Treecluster46 was employed based on the phylogenetic tree to divide isolates in the database into different “clusters”, each containing a certain number of similar isolates. The two largest evolutionary branches (capsule genotypes), KL47 and KL64, of the important sequence type ST11 in China were used to verify the cluster classification. The sum_branch method was used to divide the phylogenetic tree into an initial cluster boundary with a length threshold of 0.015. KL47 and KL64 in ST11 could not be fully distinguished at this threshold (Fig. 2a); therefore, to avoid excessive isolates in clusters influencing the accuracy, a threshold of 0.005 was used to split 0.015 clusters with more than 150 isolates (Supplementary Data 3c). After splitting, the ST-KL type was largely replaced (Fig. 2a) by 93 clusters (Supplementary Data 22, Fig. 4 and S1). The same procedure was applied to the Chinese ST11-type database, with the threshold set at 0.0001 (Fig. 2b, Supplementary Data 3b). The selected cluster was listed in the cluster ring in the phylogenetic tree (Fig. 2).At the same time, the prevalence of the most common Carbapenemase genes carried by the isolates in China, namely blaNDM and blaKPC23, is shown in Fig. 2. The specific region in China where the Chinese ST11 strain originated was added to the annotation of the evolutionary tree, specifically, the middle ring in Fig. 2b.For each cluster, Roary was used for pan-genome extraction, targeting the low data volumes of CMg samples ( ≤ 20 reads). Here, we created a specific “gene-isolate-cluster” index, which records the names of the genes contained in each isolate and the isolates in each cluster, storing the same gene only once and the relationships between these three levels to reduce the data while preserving the diverse information of all isolates (Fig. 1a). In the database, the pan-genome included 49,194 genes, while the core genome included 3157 genes. For the Chinese ST11-type database, the pan-genome included 22,004 genes, and the core genome included 4045 genes (Fig. S8).PathoTracker comparisonTo detect strain features in the sequence, the newly sequenced data were sequentially compared with genes in the pan-genomes of all clusters using BLASTn v2.9.047, with the parameter -outfmt ‘7 qseqid qlen sseqid slen pident length mismatch gaps qstart qend sstart send evalue bitscore qcovs qcovus’. The gene presence, coverage, and identity were examined using the BLAST results (Fig. 1b). As reads sequenced by Nanopore were sufficiently long, we limited the coverage to > 80%.First, we selected the cluster closest to the sequence to be tested. For each cluster, cluster similarity, \({S}_{{cluster}}\), was calculated and sorted in descending order (Fig. 1b):\({Co}{v}_{{read}}\) is the coverage of a read/contig to the gene; \({identit}{y}_{{read}}\) is the identity of a read/contig to the gene; \(n\) is the total number of reads/contigs aligned to the same gene; \({i}\) is the i-read/contig out of \(n\) reads/contigs that are aligned to the gene; \({{Cov}}_{{read},i}\) is the \({Co}{v}_{{read}}\) of i-read/contig; \({I}_{{gene}}\) represents the highest identity when the coverage was maximized for all the read/contig alignments to the gene, which is \({I}_{{gene}}={\max }_{0\le i\le n}({{Identity}}_{{read},i})\) when \({\max }_{0\le i\le n}({{Cov}}_{{read},i})\); \({m}\) is the number of matched genes in the cluster; \(j\) is the index over \(m\) matched genes within the cluster; and \({S}_{{cluster}}=\frac{{\sum }_{j=1}^{m}{I}_{{gene},j}}{m}\).Within the selected cluster, specific isolates closest to the sequence to be tested were selected using\(\,{I}_{{isolate}}\):First we calculated read/contig similarity (\({S}_{{read}}\)): \({Co}{v}_{{read}}\) is the coverage of a read/contig to the gene; \({identit}{y}_{{read}}\) is the identity of a read/contig to the gene; and \({S}_{{read}}={Co}{v}_{{gene}}* {identit}{y}_{{read}}\) is the similarity of a read/contig to the gene.Then, we calculated sample similarity to gene (\({S}_{{gene}}\)): \(n\) is the total number of reads/contigs aligned to the same gene; \({i}\) is the i-read/contig out of \(n\) reads/contigs that aligned to the gene; \({S}_{{read},i}\) is the \({S}_{{read}}\) of i-read/contig; and \({S}_{{gene}}={\max }_{0\le i\le n}({S}_{{read},i})\) is the sample similarity to the gene, that is, the maximum similarity to the gene among all alignments.We obtained \({I}_{{isolate}}\) using \({S}_{{gene}}\): \(q\) is the total number of aligned genes per isolate; \({p}\) is \(p\)-gene over \(q\) aligned genes within the isolate; \({S}_{{gene},p}\) is the \({S}_{{gene}}\) of p-gene; and \({I}_{{isolate}}={\sum }_{p=1}^{q}{S}_{{gene},p}\).Subsequently, we performed isolate sorting and final selection:For CMg samples, all filtered isolates were sorted using \({S}_{{isolate}}=\frac{{I}_{{isolate}}}{k}\): \(k\) is the number of genes in the pan-genome of each isolate. We divided \({I}_{{isolate}}\) by \({{{\rm{k}}}}\) to mitigate the impact of differences in the uncertain number of CMg reads/contigs and different numbers of pan-genomic genes in each cluster, which affected the results.For the NGS isolates, all filtered isolates were sorted using \({S}_{{isolate}}=\frac{{I}_{{isolate}}}{q}\): \({{{\rm{q}}}}\) is the number of aligned genes per isolate, as the effect of the low data volume or low specificity of the CMg data can be ignored. The top 10 \({S}_{{isolate}}\) from the selected clusters were extracted and sorted as the final result (Fig. 1b).As K. pneumoniae can easily acquire new ARGs and virulence genes, BLASTn was also used to compare the samples to be tested with the ResFinder and VFDB databases and output them together as part of the final results. The above steps were implemented using Python v3.6 and Shell.Phylogenetic analysis and species detection in the websiteWe provided user-friendly phylogenetic analysis services for the assembled data in FASTA format. A phylogenetic tree was constructed via core genome analysis using Roary and IQ-TREE. We also provided a species detection service for Nanopore sequencing data. Taxonomy was assigned using Centrifuge48. Upon completion of the run, the above results and intermediate files are emailed to the users.RAM monitoringRAM was monitored once per second using the shell command “ps – O sz, rsz, vsz” in CentOS 7.9. In this study, BAL270, BAL259, and BAL023 with 8, 1095, and 38,550 reads, respectively, and their corresponding isolates were selected to examine RAM.Website composition and architecturePathoTracker was implemented on the UNIX operating system using Django v3.2.18 as the framework and Apache Tomcat as the server. A web user interface was developed using Atlantis Lite (https://github.com/themekita/Atlantis-Lite), JavaScript, HTML5, and CSS3.AST and carbapenem resistance identificationThe MIC of commonly used antimicrobials was determined using the agar dilution method as described in the CLSI instructions49. Escherichia coli ATCC® 25922 and Pseudomonas aeruginosa ATCC® 27853 were used as quality control isolates.Validation for PathoTracker and CMg sample sourceThe CMg samples and their data after 1 h of Nanopore sequencing were obtained from a large Nanopore CMg study in China8. Human reads were removed via mapping to hg38 using Minimap2 v2.2150 to increase the pathogen detection accuracy. Centrifuge v 1.0.3-beta48 with “–met 5 –time –ignore-quals -S result.tsv –report-file filtlong-report.tsv -x database -U rmh.fq.gz” parameters was used to detect K. pneumoniae. A total of 29 CMg samples were obtained and analysed using PathoTracker (Fig. 1c, Supplementary Data 5). To validate the PathoTracker results, 29 K. pneumoniae strains isolated from CMg samples were sequenced using the 150 bp paired-end Illumina NovaSeq platform (Supplementary Data 5 and 8).Using the total number of clusters (93) as the limit, a random number was selected within each cluster as the test ID in cluster, following which seven random numbers were selected from the pool of 1187 as additional test isolate IDs. This yielded 100 isolates for the bootstrap-sampling test (Fig. 1c, Supplementary Data 3a and 4).Criteria for detection of ST-KL type and AMR and virulence genesCarbapenemase gene detection was based on ARG (blaKPC-2)23. The presence or absence of virulence genes (iucABCD, iroBCD, and rmpA2) on the classical virulence plasmid pLVPK was used to determine whether the strain contained the virulence plasmid26. The top 10 isolates from PathoTracker obtained from the comparison step closest to the sample were included in the adjudication. If < 5 isolates ( < 50%) presented AMR genes (carbapenemase genes), but at least one did, the strain might be carbapenem-resistant; if >5 isolates ( > 50%) had AMR genes, the sequence was confirmed to be carbapenem-resistant or resistant to MEM and IMP. The detection of virulence genes and ST-KL type followed the same criteria.Statistics and reproducibilityStatistical analyses were performed using stats R v4.1.0 and ggplot2 v3.3.651. Statistical analyses of two groups were performed using welch two sample t-test. The specific information about the isolates included in this study’s database can be found in Supplementary Data 1 and 2 as previously described.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles