An open-access dashboard to interrogate the genetic diversity of Mycobacterium tuberculosis clinical isolates

Construction of a genetic diversity dataset from M. tuberculosis clinical isolatesTo compile a robust dataset for comprehensive exploration of genetic diversity, we aggregated previously deposited whole-genome sequences from clinical isolates of M. tuberculosis and consolidated them into an accessible dashboard. This dataset encompasses 51,183 genomic sequences obtained from TB clinical isolates derived from infected patients. Our subsequent analysis prioritised non-synonymous mutations, indels, and genomic deletions, facilitating an in-depth meta-analysis of genetic variations across every protein encoded by M. tuberculosis. We focused only on protein-coding genes since the intended application of this dataset was to allow investigation of the presence of target-based non-synonymous changes in circulating clinical isolates of M. tuberculosis. This means that genetic differences in ribosomal RNA genes including rrs and rrl, linked to resistance to streptomycin and linezolid, respectively, are not included in this analysis. A total of 1,063,811 non-synonymous changes across 694,579 sites were observed in 4029 protein coding genes. We identified an average of 173 changes per gene, and when normalised to protein length, implicated an average of 49% of sites containing a polymorphism. The conservation at sites was generally quite high, with mean gene conservation values of 99.93%. We also isolated a smaller representative dataset of 5844 samples that better reflected the underlying genetic diversity. Compared to this representative dataset, our full dataset had similar distribution of drug susceptibility (Fig. 1A). As expected, for the full dataset, samples were mainly deposited by countries with lower TB incidence but higher whole genome sequencing capacity (Fig. 1B). Indeed, most of underrepresentation was in high burden countries in the global south especially Asia (China, the Philippines, India, Indonesia). Our full dataset was slightly over-represented for lineage 4 samples and underrepresented for lineage 2 as well as lineages 5–9 (Fig. 1C). Whilst accurate metadata of the date of collection was not available for most of the clinical samples, all the sample data collated in our dataset was deposited between 2010 and 2023 (Fig. 1D). This final dataset represents a comprehensive catalogue of genetic variance from clinical isolates for every protein in M. tuberculosis.Fig. 1Overview of the data collection and analysis–Comparison of actual or predicted drug responsiveness for the (A) total dataset (outer ring) compared to the representative sample set (inner ring) and (B) based on the originating country for the total dataset. (C) Lineage of the total dataset (outer ring) compared with the representative dataset (inner ring). (D) Histogram of date when clinical isolate was deposited.To make this extensive resource more accessible to the TB drug discovery research community, we established a user-friendly interface for data interrogation (https://www.lshtm.ac.uk/research/centres-projects-groups/satellite-centre-for-global-health-discovery#genetic-diversity). This dashboard can be used to investigate the genetic variance of any protein of interest in M. tuberculosis whilst also providing metrics such as conservation between closely related Mycobacterium species.Genetic diversity and species conservation of genes correlate with gene vulnerabilityTo compare essentiality and conservation on a global scale, we used our extensive database to compare genetic variance between clinical isolates against gene vulnerability scores, a measure of gene essentiality, identified through a genome-wide CRISPRi-mediated essentiality screen8 (Fig. 2A,B). These relationships were investigated separately for genes classed as essential and non-essential (based on the CRISPRi-mediated essentiality screen8) as these groups formed distinct clusters within the feature space. Firstly, this showed a statistical difference in the level of genetic variance of coding regions between genes classed as either essential or non-essential genes (Fig. 2A). Secondly, essential genes were more likely to have a higher number of conserved positions compared to non-essential genes. Indeed, there was a clear correlation between the vulnerability score, which uses Bayesian modelling to quantify the vulnerability of each gene, and the genetic variance between clinical isolates (Fig. 2B).Fig. 2Genome-wide comparison of gene vulnerability, genetic diversity and species conservation–(A) Histogram of genes per percentage of polymorphic positions. Insert shows direct comparison between genes predicted to be essential verses non-essential. (B) Comparison of genetic diversity (% of positions that are completely conserved amongst all isolates) and gene vulnerability score8. (C) Comparison of genetic diversity and species conservation (between mycobacterium species). (D) Comparison of gene vulnerability score and species conservation. Genes are colour-coded based on essentiality.Aside from genetic variance within a species, there is also evidence that genetic conservation between bacterial species is directly linked to gene essentiality4. The genus of Mycobacterium contains over 190 species, with the most commonly known members including M. tuberculosis and M. leprae, the causative agent of leprosy. Members of the genus are identified by their waxy, lipid rich cell walls consisting of mycolic acid. Within our dashboard, we have included the amino acid sequences of six species of this genus including M. abscessus, M. marinum and Mycolicibacterium smegmatis, together containing more than 24,000 protein sequences. Species conservation was then scored based on the average sequence identity between M. tuberculosis and the other species. As with essentiality and vulnerability, there was a statistically significant association between the genetic variance among clinical isolates and the species conservation when analysed via linear ordinary least squares regression (Fig. 2C). This supports previous work that suggests essential genes are more conserved between species in bacteria than non-essential genes. Finally, comparison of the species conservation and the vulnerability scores also revealed a statistically relevant correlation (Fig. 2D; Supplemental Table S1). Taken together this data suggests there is a relationship between the genetic conservation, both in clinical isolates and different species, and the vulnerability and essentiality of M. tuberculosis proteins (Supplemental Table S1).Identifying inherent drug resistance in new antitubercular drug targetsThe main application of this genetic diversity dataset is to provide a baseline for genetic variance of any drug target of new or future clinical compounds to measure future population dynamics. This would identify both inherent resistance within the population but also provide an indication of whether target-based mutations observed in lab-generated resistance strains could be viable in clinical isolates. As demonstration of the utility of our dashboard, we focused on the respective drug targets of four compounds undergoing stage II clinical trials for the treatment of TB: (i) SQ109, an 1,2-ethylenediamine, that targets the mycolic acid transporter, MmpL39; (ii) GSK070, an oxaborole derivative that inhibits leucyl tRNA synthetase (LeuS)10; (iii) BTZ-043, a benzothiazinone, shown to inhibit DprE111 and; (iv) Q203 (Telacebec) an imidazopyridine amide, known to target the cytochrome bc1 complex, specifically QcrB12.MmpL3 (Rv0206c) belongs to the Resistance, Nodulation and Division (RND) superfamily and transports trehalose monomycolate for cell wall biogenesis. While SQ109 is the most advanced compound to target MmpL3, there are multiple classes of compounds shown to inhibit this promiscuous drug target. During the development of these compounds, 136 unique amino acid changes have been identified in 83 different positions within the MmpL3 protein from a range of Mycobacterium species, predominately M. tuberculosis13. While many of these mutations have not been associated directly with SQ109 resistance, it is conceivable that several will lead to cross-resistance. We analysed our genetic diversity dataset to identify mutations in the MmpL3 coding sequence (Fig. 3A). SQ109 is predicted to interact with transmembrane domains (TMs) 4–5 and 10–11 (236–300 and 625–688 aa) of MmpL3, however mutations have been identified covering the whole protein13. Two mutations, unconnected to drug resistance, F384I and D466E, were prominent in our dataset and further investigation revealed that these mutations occurred almost exclusively in samples from lineage 6 and animal associated lineages. This suggests the mutations originated from single acquisition events and likely evolved under neutral evolution. We next compared the genetic diversity of MmpL3 with in vitro lab-generated mutations known to provide resistance to inhibitors of this target13. Our analysis identified genetic variance at 10 amino acid positions that can also maintain resistance-conferring mutations (Table 1; Fig. 3A). Two amino acid positions were particularly enriched with T284A occurring in 17 isolates from L4.5 originating mostly in China and Vietnam and T286M occurring in 8 isolates from L3 mostly with unknown origin as well as two isolates from the United Kingdom. In the absence of selective pressure, this would suggest that these mutations have minimal impact on bacterial growth and could be selected under drug pressure.Fig. 3Genetic diversity of next generation drug targets in current development–Gene-wide genetic variation in (A) MmpL3, (B) LeuS, (C) DprE1 and (D) QcrB. See Supplemental Figs. S1 and S2 for sequence alignment and predicted drug binding regions for LeuS and DprE1. Inserts include regions predicted to interact with inhibitors and where mutations have been identified from lab-adapted resistance strains.Table 1 Genetic diversity of clinical isolates compared with known resistance-conferring mutations.The oxaborole derivative, GSK070, is the most advanced of several chemical series targeting LeuS (Rv0041). The compounds are predicted to bind within the editing domain and multiple target-based mutations have been identified in in vitro experiments, predominately in the related pathogen, M. abscessus14,15,16,17,18,19,20 (Fig. 3B; Supplemental Fig. S1; Supplemental Table S2). The most prominent genetic variations, not connected to drug resistance, was seen at positions P54 (L4.2 associated) and R403 (L2.2 associated). These mutations occurred mostly in a single clade and thus probably only originated in the absence of selective pressure. In terms of drug resistance, two mutations–V468L and K502E (equivalent to V482 and K516 in M. tuberculosis)—identified in lab-adapted M. abscessus resistant strains were observed in our genetic diversity dataset (Table 1). DprE1 (Rv3790) is also a promiscuous target in TB drug discovery with multiple compounds inhibiting this target including BTZ-043. Indeed, a myriad of mutations providing resistance to DprE1 inhibitors have been identified from lab-adapted resistance strains largely associated with the compound binding region11,21,22,23,24,25,26 (Supplemental Fig. S2; Supplemental Table S3). A significant enrichment of A356T was observed which was associated with isolates from L1.2.1.2 (Fig. 3C), however, no genetic variance was observed that correlated with known resistance-conferring mutations. Finally, the cytochrome bc1 complex, specifically the QcrB subunit (Rv2196), is the target of multiple compounds including Q203 and several resistance-conferring mutations have been identified in the Qp (or Qo) site where the compounds are predicted to bind12,27,28,29,30,31,32,33,34,35,36 (Supplemental Table S4). Whilst, there was no major enrichment of genetic variance in this target, two known resistance-conferring mutations—T313A and M342V–were observed (Table 1; Fig. 3D; Supplemental Table S4). It is noteworthy to mention that for all of the drug targets discussed here, the genetic variance was noticeably lower in the predicted drug binding sites compared to the surrounding protein sequence.

Hot Topics

Related Articles