Core and accessory genomic traits of Vibrio cholerae O1 drive lineage transmission and disease severity

From 2015 to 2021 in Bangladesh, a diverse array of genetic variations characterises the emergence of distinct circulating lineagesTo explore the evolutionary dynamics of V. cholerae linked to cholera cases in Bangladesh, a genomic analysis was done considering the years 2015 to 2021. We sequenced 129 V. cholerae O1 El Tor isolates taken from stool samples of patients between September 2015 and April 2021 admitted to hospitals in six districts (Barisal, Chittagong, Dhaka, Khulna, Rajshahi and Sylhet) of Bangladesh, Supplementary Data 1. Over the duration of this study, isolates belonging to serotypes Inaba and Ogawa were identified, Fig. 1. Consistent with previous studies10,15, a serotype switch was observed, with Inaba predominantly present in 2016 and 2017, followed by a predominance of Ogawa samples in 2018 and 2019 (Fig. S1). Both serotypes were detected in 2015 and continued to coexist from 2020 onwards. Serotypes were significantly associated with collection years (chi-square test with p-value Bonferroni < 0.005) but not significantly associated with collection location (chi-square test with p-value Bonferroni > 0.005).Fig. 1: Maximum likelihood phylogenetic tree of the whole cohort based on the core genome of 129 isolates cultured from in-patients admitted to hospitals in six districts (Barisal, Chittagong, Dhaka, Khulna, Rajshahi and Sylhet) of Bangladesh.The two distinct BD-1.2 and BD-2 lineages are shown in the inner ring. The outer rings display serotypes, year of collection, presence of variants within the Vibrio pathogenic island 2 (VPI-2), Vibrio seventh pandemic island II (VSP-II) and phage-inducible chromosomal island-like elements 1 and 2 (PLE) and region of collection. A map of Bangladesh132 showing the proportion of samples collected from each regional division is also shown.The maximum likelihood phylogeny of the 129 isolates was reconstructed based on the alignment of the core genome (3468 genes) and showed two distinctly evolved lineages, Fig. 1. Comparison with previous studies9,10, identified these lineages as BD-1.2 (n = 84) and BD-2 (n = 45), Fig. S2. Apart from the previously reported genetic variations9,10, we identified additional differences existing between the two lineages in VSP-II, Vibrio pathogenic island 2 (VPI-2) and PLE, see Fig. 1. More precisely, in VSP-II, BD-2 isolates had a tryptophan at position 249, while BD-1.2 had a leucine at this position. In addition, in VSP-II, gene VC-514 (aer) was present in all BD-2 isolates but absent in BD-1.2. In VPI-2 a SNP led to an amino-acid variation at position 150, with BD-1.2 having an aspartic acid, and BD-2 an asparagine. BD-2 samples exclusively exhibited PLE2, while BD-1.2 samples had both PLE1 and PLE2 along with PLE2. Moreover, further differences were found in nonsynonymous SNPs on core genes and presence/absence of accessory genes, as described in the following section.The distinct phylogeny patterns of BD-2 and BD-1.2, were also confirmed through a comparative study analysing 1134 isolates from V. cholerae El Tor O1 strains across 84 countries, including our isolates, (Supplementary Data 2, 3 and Fig. S3). BD-2 isolates clustered with Indian-1 (IND-1), while BD-1, BD-1.1, and BD-1.2 isolates from Bangladesh clustered with African (T9-T13)16, Latin America-3 (LAT-3)13, Asian-2 (AS-2), and Indian-2 (IND-2) lineages (Fig. S3), in agreement with previous results10.Genetic and temporal differentiation of V. cholerae BD-1.2 and BD-2 lineages correlate with SNPs in coding and non-coding regions, and accessory genesTo assess the relatedness of V. cholerae isolates in our cohort, we measured the number of different core genome SNPs in a pairwise manner across all isolates. We created a network based on clusters of related isolates with less than 15 SNPs, as done previously17,18. Across the cohort the median SNP difference was 117 SNPs (ranging from 0 to 1710 SNPs with IQR of 1211). The resulting undirected graph (Fig. 2) revealed that BD-2 and BD-1.2 formed two disconnected graphs each composed of samples from a specific lineage, but with no distinct separations between the Ogawa and Inaba serotypes.Fig. 2: SNP network analysis of highly connected isolates.Network diagram showing pairwise connections between isolates in our cohort with less than 15 pairwise SNP differences. The panels show the same network with the nodes colour-coded according to (A) lineages, (B) year of collection, (C) serotypes and (D) location of collection. The lines between pairs of isolates are colour-coded by the number of SNPs.To identify additional potential involvement of genetic elements in shaping the differences between the BD-1.2 and BD-2 isolates in our cohort, beyond current annotations (ctxB allele, type of SXT/ICE, VSP-II, VPI-I, gyrA gene allele)10, we looked for patterns of similarities and differences, at a finer scale, searching for the number, type and position of accessory genes as well as mutations in the core genome and intergenic regions across all the isolates. A two-sided Fisher exact test, with Bonferroni correction, was performed to assess the relationship between the BD-2 and BD-1.2 lineages and each of the various genomic features (core and intergenic SNPs and accessory genes). Overall, we found a significantly larger proportion of core genome mutations (51.4%, 1224 core genome SNPs and 73.1%, 160 intergenic SNPs) and a small proportion of accessory genes (11.3%, 115 genes) that exhibited statistically significant differentiation between the two lineages, Supplementary Data 4. Refer to Supplementary Note 1 and Fig. S4 for more details on the statistical analysis comparing the number of accessory genes, core genome SNPs and intergenic SNPs. The comparative analysis also indicated a temporal shift in the distribution of core genome and intergenic SNPs over the years, showing that BD-1.2 isolates accumulated different SNPs compared to BD-2 isolates as time progressed (Fig. S4E, F).Out of the 115 accessory genes that differed between the two lineages, 12 were annotated while the remaining 101 were hypothetical. Among these 12 annotated genes, five (lon_3, endA, adh, hdfR_4 and bcr_2) were predominant (over 96% presence) in BD-1.2 and absent in BD-2, and seven (aer_3, hlyA_2, mcrC, mepM_3, mrr, tetA and tetR) were present (over 97% presence) in BD-2 and absent in BD-1.2. Of the twelve annotated genes, three are known to be antimicrobial resistance genes (bcr, tetA and tetR)19. TetA and tetR were mainly detected in BD-2 isolates (97.7%), confirmed as primarily tetracycline-resistant through susceptibility testing in both doxycycline and tetracycline antibiotics (Supplementary Data 1). On the contrary, bcr, a multidrug efflux pump, was predominantly present in BD-1.2 isolates (96.4% of isolates) and completely absent in BD-2 isolates. Out of the 16 known antimicrobial resistant genes (ARGs) present in the pangenome of this cohort, only tetA, tetR and bcr were found to statistically separate both lineages. TetA and tetR were both located in a contig showing high similarity to the SXT-ICE element, SXT(HN1) in BD-2 isolates. Conversely, bcr was found in a mobile element in the BD-1.2 isolates with similarity to SXT ICE element, ICEVchBan5. The presence of these SXT elements in the BD-2 and BD-1.2 lineages was previously shown by Monir et al.10. Both contigs contained two identical insertion sequences, mobile genetic elements MGEs, (ISShfr9 and ISVsa3), see Fig. S5. Also, among the 12 annotated genes, four (endA, hlyA, lon and mcrC) were previously found to be related to virulence18,19,20,21,22,23. More information about the function of these genes is given in the Supplementary Note 2.To assess the extent of our results beyond our cohort, we investigated whether the 12 annotated accessory genes that we had found were also present in other Bangladeshi and Indian lineages. We performed a comparative genomic analysis of 219 Vibrio cholerae O1 reference isolates collected in Kolkata, India, and Dhaka, Bangladesh, between the years 2004 and 2022 (ENA public database http://www.ebi.ac.uk/ena, see Supplementary Data 5). The results confirmed the presence/absence patterns of the 12 genes in the BD-1.2 and BD-2 lineages in the reference isolates, aligning with our initial findings, see Supplementary Note 2.In addition to differences in accessory gene types and patterns, missense mutations associated to allelic variations were found in BD-1.2, when compared to BD-2 strains. We identified 1385 SNPs in the core genome, including 291 non-synonymous and 934 synonymous coding variants, both representing variants in their functional protein-coding form. In addition, 160 intergenic SNPs were found, representing variants in their regulatory form. Many SNPs showcased unique allelic distribution patterns between the two lineages. When mapped back, the non-synonymous SNPs identified 291 amino acid substitutions in 105 genes, including 50 known genes and 55 hypothetical ones (see Supplementary Data 4). Table S1 shows core genes with allelic distribution between BD-1.2 and BD-2 significantly different (i.e., containing polymorphic sites found exclusively in one lineage but absent in the other lineage).Among the genes exhibiting lineage-specific allelic variation, some contribute to functions including growth, cell wall organisation, colonisation, toxigenicity and resistance, similar to what found previously10. Additionally, we found genes with a unique non-synonymous variant in BD-1.2, with roles in toxin transport and acid tolerance, shedding light on functions that may clarify their contribution to the recent prevalence of BD-1.2 over BD-2. See Supplementary Note 3 for more information about these genes. Notably, OmpU is another gene with a statistically significant mutation (G325D) underlying lineages’ separation. Amino acid D is predominant in BD-1.2, while the amino-acid G is prevalent in BD-2. To assess for any additional genes separating the BD-1.2 and BD-2 lineages we also conducted an analysis on the pangenomes of the lineages separately but found the results broadly in line with that of the combined pangenome analysis presented above (Supplementary Note 4 and Supplementary Data 6-10).To understand the systemic relationships connecting the identified lineage-specific genetic signatures on a mechanistic level, we analysed the 30 core genes in Table S1 with allelic variants that were found exclusively in one lineage but absent in the other lineage using the V. cholerae GSM model iAM-Vc960 (Fig. 3). Thirteen of these genes (murI, ftsI, appC, suhB, glmM, dsbD, licH, cysG_1, cobB, clcA, argG, mak, phhA) are metabolic and have been identified as playing integral roles in amino acid metabolism, cell wall metabolism, carbon metabolism, amino sugar and nucleotide sugar metabolism and energy metabolism (see Supplementary Data 11). Moreover, for these genes we sought to better understand their role by examining their effects on V. cholerae growth rate, biochemical networks and production of metabolites in the networks. As the effect of mutations/gene knockouts cannot always be observed as change in growth rate (due to the redundancy of the reactions in metabolic networks of bacteria), it can be useful to also consider the changes in metabolite yield. Changes in metabolite yield have been found to correlate with changes in the virulence, persistence, and fitness of some organisms24. Furthermore, V. cholerae are capable of adapting to ecological niches by altering the metabolites they excrete to create a more favourable environment for V. cholerae and/or a less favourable environment for other species competing for the same resources25,26. Mutations disrupting larger numbers of metabolite yields may be suggestive of a larger systems-level impact on bacterial metabolic function. Therefore, gene essentiality, flux variability analysis (FVA) and flux balance analysis (FBA) were used to predict, through gene knockouts, the essentiality and the effects of the identified genetic determinants on the growth rates of V. cholerae, and also used to further explore their influence on metabolite yield. The latter was done by assessing the influence on metabolite flow within the complete metabolic network of V. cholerae, encompassing all known metabolites and metabolic reactions (see Methods). In this analysis it was important to consider all reactions and metabolites in the model rather than focussing on a subset, as doing so ensures no undue bias or assumptions underlie the results.Fig. 3: An overview of the metabolic pathways associated to the core genes underlying the BD-1.2 and BD-2 lineages separation.All genes annotated were found to have reduced flux span through the metabolic system when knocked out. Genes coloured in blue have a significant different allelic distribution between BD-1.2 and BD-2, associated metabolic pathways are labelled in purple. All 3D protein structures were generated in Alphafold123 under a Creative Commons Attribution 4.0 license (CC-BY 4.0), no changes were made.The genes cysG, clcA, adh and mcrC, were found to be essential for growth (i.e., knocking these genes out reduced the biomass growth to less than 0.0001 h−1) in both rich and minimal media. Furthermore, murI, glmM, and dapF displayed auxotrophic behaviour in minimal media, whereas cysG, clcA, adh, and mcrC were found to be essential in rich media with alternative carbon sources. Additionally, three genes, murI, glmM and dapF, were found to be essential for growth in minimal media only. Next, FVA was used to identify biochemical reactions whose flux span was significantly changed (greater than 10% change) by knocking out these genes. In total ten genes (murI, glmM, cysG, clcA, argG, mak, adh, dapF, add, and mcrC) when knocked out significantly changed the flux span in at least one reaction through the model by FVA analysis, Supplementary Data 11. Finally, FBA analysis was used to determine the effect of gene knockouts on metabolite yield. Five genes, murI, glmM, cycG, mak, and dapF were found to reduce at least one metabolite yield to zero in the model when knocked out (given the wildtype yield was greater than 0), Supplementary Data 1127,28. Interestingly, the average number of metabolite yields affected by knockouts of the genes discriminating lineages was significantly higher than a random selection of 100 metabolic genes (p-value 0.0429, Mann Whitney U test, two-sided), indicating a stronger influence on metabolite production for this subset of genes.To further elucidate the metabolic differences between the BD-1.2 and BD-2 lineages, we repeated our previous analyses done on the generalised model using strain-specific models automatically generated by CarveMe27. Gene essentiality analysis concurred with the general model (iAM-Vc960), with only a small number of differences (Supplementary Data 12 and Supplementary Note 5). The effect of murI gene knockouts differed between lineages, proving non-essential in 94% of BD-1.2 lineage models but only in 76% of BD-2 lineage models. Flux variability analysis of the individual models revealed that clcA knockouts led to significant changes in the flux span of the CLt3_2pp reaction, which controls chloride transport, in 96% BD-2 models compared to just 5% of BD-1.2 models. The clcA gene has been linked to bacterial acid resistance and it has been suggested that changes to the expression/repression of this gene may help facilitate survival during movement through the intestinal tract28. Similarly, flux balance analysis indicated that metabolite yield was changed differently across lineages in response to knocking out clcA, with the metabolite yield of chloride reduced to 0 in 95% of BD-1.2 isolates.In summary, a total of 15 genes found to underly the genetic and temporal differentiation of V. cholerae BD-1.2 and BD-2 lineages, were also found to significantly alter the growth, reaction flux, or metabolite yield of V. cholerae when knocked down, either in the generalised iAM-Vc960 GSM model or in the draft strain-specific models. Of interest was the gene clcA, which showed differences in both flux span and metabolite changes between lineages in the draft GSM models. The FVA and FBA results indicate that these genes play important metabolic roles. Disruption of these functions could potentially affect bacterial growth or metabolic output, which may contribute to the survival and dominance of one lineage over another. Although our analysis cannot pinpoint a single SNP as responsible for the loss of metabolic function, it suggests that an accumulation of SNPs or gene losses could collectively lead to metabolic changes. We observe the potential for metabolic alterations driven by multiple mutations (SNPs).Lastly, when mapping the 160 intergenic SNPs back to genomes, we found their location in the upstream/downstream regions of 35 known genes and 34 hypotheticals genes (see Supplementary Data 4). These intergenic SNPs exhibited allelic distribution, with the minor variant prevalent in the BD-2 isolates (68% to 100%), while the major variant dominated in the BD-1.2 isolates (over 98%), only one SNP in BD-1.2 had a major allelic variant at of 47% (Fisher exact test, Bonferroni correction p-value < 2.31e-08). Many of these SNPs were located within transcriptional factor binding sites (TFBs) (Supplementary Data 4). Intergenic SNPs, exhibiting significantly different allelic distributions between BD-1.2 and BD-2, mapped across the TFBs of 11 TFs (ToxT, Fur, AmpR, OmpR, LuxR, LexA, ArgR, PhoP, CRP, ArcA, IHF) (Fig. S6-S16). More information about the function of these transcriptional factor binding motifs is provided in Supplementary Note 6.Machine learning unravels correlations between genomic determinants and clinical symptoms in humansBeyond identifying the potential involvement of new genetic traits in differentiating the BD-1.2 and BD-2 lineages, we hypothesised that the same or additional genetic features might play a significant role in the manifestation and severity of clinical symptoms in patients when infected with V. cholerae. A summary of the distribution of each clinical symptom over the two lineages is given in Fig. S17. We focused on the lineage BD-1.2, which caused the most recent outbreak in Bangladesh. To identify if and which coding and non-coding mutations and/or presence/absence of accessory genes would correlate with the different clinical symptoms, we employed a bespoke, supervised machine learning pipeline.The pipeline is aimed at mining sequencing data to identify the genetic elements that more strongly correlate with observed clinical symptoms, which in this case are vomiting, dehydration, number of stools, duration of diarrhoea and abdominal pain (see Methods section). The pipeline is a bespoke adaptation of ML-based data-mining methods previously developed within our team to identify correlations between genomic features with phenotypes17,18,29,30. In the pipeline, information about different genetic features (SNPs -both from coding and non-coding regions- and presence/absence of accessory genes) can be encoded as input to ML-powered predictive models designed to estimate the likelihood of observing the selected phenotypes under each specific pattern of input values17. As long as trained with sufficient observational data, the ML-powered predictive models are able to replicate experimental evidence, in addition to providing information on what inputs correlated most strongly with each phenotypic manifestation. Through such introspective power, the pipeline is able to unravel co-occurrent, multiple mechanisms (mutations, horizontal gene transfer – HGT), variants in their functional protein-coding and regulatory forms, as well as their additive effect on the targeted phenotypes, which in this work, were clinical symptoms.The following clinical symptoms were selected, namely: vomiting, abdominal pain, diarrhoea duration, 24-hour stool count and dehydration. Each clinical symptom was handled by building a dedicated symptom prediction model, operating using genetic elements as inputs. Two symptoms (vomiting and abdominal pain) were encoded as binary (presence vs absence). The other three symptoms—diarrhoea duration, 24-h stool count, and dehydration—were encoded as multi-class: dehydration as None, Moderate and Severe; diarrhoea duration as <1 day, 1–3 days, 4–6 days, and 7–9 days; and stool count in 24 h as 3–5 times, 6–10 times, 11–15 times, 16–20 times, and 21+ times. We handled the prediction of multi-class symptoms via the implementation of binary predictors.The symptom prediction models were developed with built-in robustness to potential confounding factors. Specifically, the following list of variables was initially considered as potentially having confounding effects: year of collection, location of patient, sex of patient, age range of patient and serology of V. cholerae. Each potential confounder was tested for correlation to the symptom being targeted by the prediction model. If the potential confounder was found correlated to the symptoms (hence moving from potential to proven confounder), then any other input variable also found correlated with the same confounder would be eliminated from the prediction model. All the correlation tests between inputs and symptoms, as well as between inputs themselves, were run using two-sided chi-square tests. Further, possible confounding effects related to random initialisation parameters of SMOTE (see Methods) were contained by running SMOTE multiple times.The development and optimisation of each symptom prediction model powered by machine learning was based on running a comparative analysis of the predictive performances of different machine learning algorithms, namely: linear support vector machine (linear SVM), non-linear SVM with radial basis function (RBF SVM), random forest, extra-tree classifier and logistic regression) and two meta-methods (Adaboost and XGBoost). For each algorithm, multiple configurations of the hyperparameters of the learning algorithms were tested. A nested cross validation approach was used to select the best hyperparameters, based on randomly selecting different training and test sets, and using stratified k-fold cross validation metric. Finally, Friedman and Nemenyi tests were used to statistically compare and select the best performing algorithm for each prediction model (see Methods section).In the end, based on a two-sided chi-square test of independence (p-value < 0.01), the models for abdominal pain, vomiting, number of stools 11–15 times vs. 21+ times, number of stools 11–15 times vs. 16–20 times, dehydration moderate vs severe were found immune to confounding effects due to year of collection, location of patient, sex of patient, age range of patient and serology of V. cholerae. The prediction model: diarrhoea duration <1 day vs 1–3 days was found immune to confounding effects due to age range of patient, sex of patient, location of patient, and serology of V. cholerae. However, the prediction model was found to be influenced by year of collection; therefore, the inputs that were also correlated to year of collection were removed from the analysis (Supplementary Data 13). Moreover, we were able to successfully develop six binary symptom prediction models featuring adequate prediction performance levels. These were dedicated to predicting the following binary phenotypical outcomes: (i) stools 11–15 times vs. 16–20 times; (ii) stools 11–15 times vs. 21+ times; (iii) moderate vs. severe dehydration; (iv) diarrhoea duration <1 day vs. 1–3 days; (v) presence vs absence of vomit; and vi) presence vs absence of abdominal pain (Supplementary Data 14). The remaining binary predictors were discarded for not performing adequately, either because of unbalanced available sets of observations (needed for training the supervised ML models), or because of more challenging separability of the phenotypes given the selected inputs (no features were statistically significant based on the Fisher exact test). Among the tested pipeline technologies mentioned earlier, logistic regression was identified by the Friedman F-test and the Nemenyi post-hoc analysis as the best performing one (Fig. S18). Of the six binary prediction models, four had an AUC greater than 0.9, Fig. 4. Supplementary Data 15 indicates the performance metrics obtained by all binary predictors for each clinical symptom. Figs. 4 and S19 show the performance results for the logistic regression classifier.Fig. 4: Supervised machine learning pipeline accurately predicts the clinical manifestations of hospitalised patients from the genomic determinants extracted from BD-1.2 isolates, collected among the same hospitalised patients.A Flow diagram showing machine learning pipeline including data (green), pre-processing steps (yellow) and classification (blue). B Machine learning performance results measured by the area under the curve (AUC) from 30 training runs for clinical symptom combination. The results shown are for the best classifier Logistic Regression, as defined by the Nemenyi test (Fig. S18). The violin plots show the distribution of the data, with each data point representing one classification model. Inside each violin plot is a box plot, with the box showing the interquartile range (IQR), the whiskers showing the rest of the distribution as a proportion of 1.5 x IQR and the white circle representing the median value. C Number of features (accessory genes, core genome and intergenic SNPs) selected for each symptom. Predictive models were generated for six different clinical symptoms (X-axis): abdominal pain; dehydration Moderate vs. Severe; duration of diarrhoea <1 day vs. 1–3 days; number of stools 11–15 times vs. 16–20 times; number of stools 11–15 times vs. 21+ times; and vomit.Analysis of the best-performing symptom prediction models allowed us to identify the input features (core genome coding and intergenic SNPs and accessory genes) most strongly correlated to each phenotype (Supplementary Data 16). Seventy-nine different features in total were selected as significantly correlated to at least one of the six symptom prediction models, with 68% being selected in two or more models (Fig. 5). No features were selected for all symptoms. All features associated with number of stools 11–15 times vs. 21+ times were found associated to at least one of the other five symptom prediction models. Forty-five accessory genes (nine known genes, tufB_2, blc, pckA, luxR_2, hcpA_1, rpoS, dcuA, hpt, luxR, and 36 hypothetical genes) and 28 core SNPs over 23 genes (14 known, clpS, gshB, dapF, fabV_1, add, tufB, lpoA, phrB, yjcS, fabH1, cysG_2, padC, pepN, tadA_2, and nine hypothetical genes) were identified as strongly associated to at least one of the symptoms. From the nine known accessory genes: four (rpoS, hpt, luxR and pckA) were found in the vomit model; dcuA was found in the abdominal pain model; hcpA_1 was found only in the number of stools 11–15 times vs. 16–20 times; luxR_2 was found in two models (vomit and dehydration moderate vs severe); blc and tufB_2 were found in three models (vomit, number of stools 11–15 times vs. 16–20 times and number of stools 11-15 times vs. 21+ times) with tufB_2 also found in abdominal pain and diarrhoea duration <1 day vs. 1–3 days models. Six SNPs from the genes tufB, dapF, clpS, gshB and fabV were associated to three symptom prediction models (vomit, number of stools 11–15 times vs. 16–20 times and number of stools 11–15 times vs. 21+ times) with the SNPs from the genes dapF and fabV also associated with abdominal pain and diarrhoea duration <1 day vs. 1–3 days and the SNP from the gene tufB associated with dehydration moderate vs severe.Fig. 5: Undirected graph network illustrating the genomic features associated with clinical symptom models for V.cholerae. Node colour denotes the genomic determinant category, (i.e. accessory genes and/or core genome coding, and intergenic SNPs) identified by machine learning. Nodes are labelled with numbers corresponding to specific genes associated with each genomic determinant, as detailed in the Genes Legend, while unnumbered nodes are related to unannotated (hypothetical) genes. The clinical symptom models are highlighted in different colours and explained in the legend Symptoms Legend featuring abdominal pain; dehydration Moderate vs. Severe; duration of diarrhoea <1 day vs. 1–3 days; number of stools 11–15 times vs. 16–20 times; number of stools 11–15 times vs. 21+ times; and vomiting.Among the 45 accessory genes linked to clinical symptoms, six hypothetical genes were also statistically significant in distinguishing the two lineages. Among the other accessory genes selected, four (blc, pckA, luxR and rpoS) have important biological functions. In particular, Blc, also known as VlpA, is a lipocalin, that is correlated to acquisition of drug resistance in V. cholerae31. PckA (phosphoenolpyruvate carboxykinase) is important for gluconeogenesis, a highly conserved pathway in bacteria and humans. Interfering with  the gluconeogenesis pathway impacts V. cholerae colonisation in mouse models, highlighting its crucial role in sustaining V. cholerae growth and viability within the intestines32. LuxR plays a key role in regulating biofilm production and secretion in V. cholerae33. RpoS is a sigma factor that facilitates physiological adaptation to general starvation and stationary phase growth in different species. V. cholerae strains lacking the gene rpoS are impaired in their ability to survive in different environmental stresses. RpoS was also shown to be important in V. cholerae for efficient intestinal colonisation34.Out of the 28 core SNPs associated to the clinical symptoms, 11 were also found previously as statistically significant in differentiating the BD-2 and BD-1.2 lineages (see above), Supplementary Data 16. These 11 SNPs mapped to 11 genes (clpS, gshB, dapF, fabV_1, add, and six hypothetical genes). Among the SNPs mapping to known genes (clpS, gshB, dapF, fabV_1, add), three are non-synonymous SNPs mapping to clpS, gshB and fabV. In V. cholerae clpS regulation involves cAMP receptor protein (CRP)31. CRP is important in intestinal colonisation35. GshB, encodes a glutathione synthetase (GSH), which is associated to resistance to oxidative stress. V. cholerae fabV is one of the several triclosan-resistant ENR encoding genes36.As in our previous lineage analysis, we sought to better understand the importance of the genes which had been found to better correlate with the severity of the symptoms. We examined for those genes that were metabolic, through FVA and FBA, the effects of such genes on growth rate (gene essentiality), and beyond that, their influence on metabolite yield and reaction flux. Nine symptoms-related genes were identified as metabolic genes in the iAM-Vc960 GSM model (Fig. 6). Eight of these genes were associated to five metabolic systems (Supplementary Data 17). FabH1 and gshB associated with cofactor and prosthetic group metabolism; pckA is associated with carbohydrate metabolism; dcuA plays a crucial role in C4-dicarboxylate transport; dapF, pepN and gshB are significant in amino acid metabolism; add and pckA are relevant to nucleotide metabolism; oppA and fabH1 are involved in cell wall metabolism, with fabH1 relevant for fatty acid biosynthesis (Supplementary Data 17).Fig. 6: An overview of the metabolic pathways impacted by statistically significant genes underlying clinical symptoms.All genes annotated were found to have reduced the flux span through the metabolic system when knocked out. Genes coloured in pink and purple carried mutations or are accessory genes associated to the clinical symptom, respectively, and connected metabolic pathways (labelled in blue). The genes coloured in purple were also found as statistically significant in differentiating the BD-2 and BD-1.2 lineages (see previous sections). All 3D protein structures were generated in Alphafold123 under a Creative Commons Attribution 4.0 license (CC-BY 4.0), no changes were made.Using FBA and FVA analysis, the knockouts of the genes dapF and gshB were found to halt production of several metabolites. The genes pckA, add, dapF, oppA, gshB were found to significantly change the reaction flux span, Supplementary Data 17. Both FBA and FVA analysis can infer if potential metabolic adaptation mechanisms for V. cholerae can lead to alterations in bacterial virulence, potentially leading to worse symptoms, if genes significantly affect pathways which are associated with important functions such as colonisation, biofilm production and cell wall synthesis. For example, the gshB gene, a glutathione reductase, contributes to V. cholerae intestinal colonisation37 and has a role in acid tolerance response38. Similarly, dapF was found as an essential gene in minimal media and leading to auxotrophic behaviour to the amino-acid lysine. As Pearcy et al.39 indicated, an auxotrophic behaviour of a gene connected to amino-acid biosynthesis is important because it can provide competitive fitness advantage against commensal bacteria. During the infection stage V. cholerae engage and compete with commensal bacteria for nutrient acquisition to support rapid growth and multiplication40. Moreover, the lysine pathway plays a central role in eubacteria cell wall biosynthesis, since meso-diaminopimelate is the immediate precursor for the biosynthesis of its main component, peptidoglycan, with dapF responsible for the synthesis of meso-diaminopimelate in the lysine pathway41,42. The proper synthesis and maintenance of peptidoglycan is essential for bacterial virulence and its viability43.To further investigate the link between metabolic gene variations and the clinical symptoms observed in different strains, we utilised draft strain-specific models generated with CarveMe27. The gene essentiality analysis results were largely consistent with those of the general model (iAM-Vc960), with only a few differences noted (Supplementary Data 18). The effect of dapF gene knockouts varied between models with the gene being essential in 93% (n = 20) and non-essential in 7% (n = 9) of the models. Comparing symptoms between the ‘essential’ and ‘non-essential’ groups, dehydration was significantly more severe in the ‘non-essential’ group (Fisher exact test p-value = 0.05). All strains in this group exhibited severe dehydration, suggesting a link between non-essentiality of the dapF gene and the severity of V. cholerae symptoms. In relation to this, the flux balance analysis revealed changes in metabolite yields associated with the genes dapF and cysG_2 across all strain-specific models. For dapF, altered metabolite yields were predominantly observed in strains where dapF was essential, while knocking out dapF in non-essential models had minimal impact on the metabolite yields of murein-related metabolites. This indicates metabolic adaptations linked to bacterial survival in these strains, potentially contributing to more severe disease outcomes. Additionally, knocking out the padC gene resulted in significant changes in metabolite yields only in the NGICDV-066 strain. Although conclusions drawn from a single strain are limited, it is notable that this isolate exhibited the most severe clinical symptoms across all measured symptoms, except for the duration of diarrhoea (presence of vomiting, presence of abdominal pain, number of stools (21+ times), presence of severe dehydration, duration of diarrhoea 1-3 days). Flux variability analysis in individual models indicated consistent behaviour across all strain-specific models regarding gene knockouts associated with clinical symptoms. Specifically, five gene knockouts (add, dapF, gshB, padC, pckA) showed significant flux span changes in all models.In summary, in relation to gene essentiality, reaction flux and metabolite yield, our results show that gshB and dapF make interesting candidates for further analysis, as knockout models of these genes predict significant changes to the bacterial metabolic function.To delve deeper into understanding the functional mechanisms underlying clinical symptoms, we explored the interactome of the proteins associated to the clinical symptoms. The protein-protein interaction network (PPI) analysis revealed the interactome of 36 proteins, selected by the machine learning pipeline, with 109 other proteins, Fig. S20. The KEGG analysis indicated enrichment in ribosome proteins (e.g., RpoS) and fatty acid biosynthesis (e.g., FabH1, FabV) (Fig. S21). The colonisation in the human intestine and virulence of V. cholerae is intricately connected to both fatty acid metabolism44 and the ribosome pathway45. The gene onthology (GO) analysis highlighted enrichment in translation, peptide biosynthetic processes, and gene expression, featuring TufA, TufB, RpoS, GshB (Supplementary Data 19 and 20). The peptide biosynthetic pathway plays a vital role in V. cholerae biofilm formation and colonisation23.None of the six intergenic SNPs selected by the machine learning pipeline were in TFBs or promoters. These SNPs were located in a region without any functional annotations within 2 kbps upstream or 0.5 kbps downstream of a gene, adhering to the standard database dbSNP cutoffs for SNP-to-gene mapping46,47. See Supplementary Data 16 for additional information about the location of these SNPs.Structural analysis suggests evolutionary drivers of selection, mechanistic bases for BD-2 and BD-1.2 lineages evolution, and associations to clinical symptomsTo further understand whether the identified alleles play a causal role in the evolution of lineages and clinical symptoms, we selected two of the top-ranked non-synonymous SNP candidates, prioritising the following aspects in relation to the associated genes: (i) have significant difference of allelic distribution between BD1-1.2 and BD-2; (ii) have a significant correlation, as detected by the ML pipeline, with the selected clinical symptoms; (iii) are characterised as functionally important for V. cholerae metabolisms (i.e. significantly impacting reaction flux when knocked out, as highlighted by the GSM model) and/or interactome (i.e. enrichment of the functions and mechanisms related to pathogenesis); (iv) 3D structural mutation analysis could be benchmarked with experimental evidence. This resulted in three genes, all top-ranked by both the Fisher Exact test for BD-1.2 and BD-2 lineage evolution and the ML analysis for the underlying clinical symptoms, namely: fabV, gshB and clpS. We mapped the alleles of fabV, gshB and clpS to their protein structures using both experimental crystal structures and predicted homology models. However, the 3D-structure could be utilised to infer the mechanistic basis only for fabV and gshB.In all BD-2 isolates FabV had a proline at position 149 (Pro149) whereas, in BD-1.2 isolates, the Pro149 was found in only 40.5% of cases, with the remaining 59.5% isolates exhibiting histidine at position 149 (His149). The BD-1.2 isolates with His149 showed a higher duration of diarrhoea (1–3 days) and a higher number of stool score (16-20 times and 21+ in 24 h) compared to the BD-1.2 isolates with Pro149, featuring a lower diarrhoea duration (<1 day) and lower number of stools score (11–15 times). The amino acid 149 was located in the trans-2-enoyl-CoA reductase catalytic domain (Fig. 7A–E), when Pro149 is present, it interacts with Lys148, Ser151, Trp159 through Van der Waals (VDW) interactions, whereas His149 not only forms the aforementioned interactions but also creates an extra VDW interaction with Lys148. Furthermore, His149 interacts with an additional amino acid, Arg150, through a VDW interaction. These additional interactions in the presence of the His149 cause an increase in the stability of the structure (ΔΔG = 0.101 kcal/mol > 0) and a decrease of the molecule flexibility (ΔΔSVib ENCoM: −0.053 kcal.mol−1 K−1), which is usually linked to a stronger binding affinity48,49. Moreover, the presence of His149 increased the positive charge of the surrounding area (Lys148, His149, Arg150) (Fig. S22), with an overall electrostatic energy increasing from 7.3E + 03 kJ/mol (Pro149) to 7.48E + 03 kJ/mol (His149) within the 5 Å region and with an overall protein total electrostatic energy rising from 2.1E + 05 kJ/mol (Pro149) to 2.52E + 05 kJ/mol (His149). Exposed, positively charged amino acids are suggested to promote interactions with negatively charged cellular systems50. The enhanced positive charge of FabV in the presence of His149 might support its role in participating in the breakdown of the negatively charged fatty acids.Fig. 7: 3D protein structure analysis of FabV allelic variants underlying BD-1.2 and BD-2 lineage evolution and clinical symptoms.A Violin plot indicating the distribution of the diarrhoea duration score (0: no diarrhoea, 1: <1day, 2: 1–3 days, 3: 4–6 days and 4: 7–9 days) for the isolates containing either Pro149 (P) or His149 (H). Statistical significance was tested with a two-sided Mann Whitney U test, p-value is shown. B Violin plot indicating the distribution of the number of stools score (0: <3 times, 1: 3–5 times; 2: 6–10 times; 3: 11–15 times; 4: 16–20 times; 5: 21+ times) for the isolates containing either Pro149 (P) or His149 (H). Statistical significance was tested with a two-sided Mann Whitney U test, p-value is shown. C The bar graph displays the number of isolates in the two BD lineages associated with Pro149 (P) and His149 (H). D 3D structures of FabV (AlphaFold) with Pro149 and coloured by functional domains. Amino acid residues (Lys148, Ser151, and Trp159) interacting with Pro149 (green) are shown in sticks models. E 3D structures of FabV (AlphaFold) with His149 and coloured by functional domains. Amino acid residues (Lys148, Arg 150, Ser151, and Trp159) interacting with His149 (purple) are shown in sticks models.GshB, a glutathione reductase, has been shown to contribute to V. cholerae intestinal colonisation37 and to have a role in the ability of V. cholerae to mount an acid tolerance response38. In all BD-2 isolates GshB had a threonine at position 93 (Thr93), whereas in the BD-1.2, the Thr93 was only found in 21.5% of the cases, with most (78.5%) of the BD-1.2 isolates exhibiting an isoleucine (Ile93) at this position. The BD-1.2 isolates with Ile93 are associated to a higher duration of diarrhoea (1-3 days) and a higher number of stool score (16-20 times and 21+ in 24 h) compared to the BD-1.2 isolates with Thr93. Thr93 interacts with Asp92, Ile96, Tyr97 through 13 VDW interactions and 1 H-bond; whereas Ile93 not only forms the aforementioned interactions but also creates extra VDW interactions with Tyr97 (Fig. 8A–E). These additional bonds in the presence of Ile93 cause an increase in the stability of the structure (ΔΔG = 0.384 kcal/mol >0) and a decrease of the molecule flexibility (ΔΔSVib ENCoM: −0.055 kcal.mol-1.K-1), which is usually linked to a stronger binding affinity48,49. Moreover, the presence of Ile93 increased the negative charge of the surrounding area (<5 Å) (Fig. S23A, B), with an overall electrostatic energy decreasing from 7.93E + 03 kJ/mol (Thr93) to 7.4E + 03 kJ/mol (Ile93) within the 5 Å region and with an overall protein total electrostatic energy varying from 2.1E + 05 kJ/mol (Thr93) to 1.8E + 05 kJ/mol (Ile93). A decrease in total electrostatic energy is often associated to folding51, protein folding stability is largely dependent on the hydrophobic interactions of nonpolar residues52. The surface, on average, has become more hydrophobic, indicating a possible reorientation of residues or a change in the surface’s exposure to the solvent (Fig. S23C, D).Fig. 8: 3D protein structure analysis of GshB allelic variants underlying BD-1.2 and BD-2 lineage evolution and clinical symptoms.A Violin plot indicating the distribution of the diarrhoea duration score (0: no diarrhoea, 1: <1day, 2: 1–3 days, 3: 4–6 days and 4: 7–9 days) for the isolates containing either Thr93 (T) or Ile93 (I). Statistical significance was tested with a two-sided Mann Whitney U test, p-value is shown. B Violin plot indicating the distribution of the number of stools score (0: <3 times, 1: 3–5 times; 2: 6–10 times; 3: 11–15 times; 4: 16–20 times; 5: 21+ times) for the isolates containing either Thr93 (T) or Ile93 (I). Statistical significance was tested with a two-sided Mann Whitney U test, p-value is shown. C The bar graph displays the number of isolates in the two BD lineages associated Thr93 (T) or Ile93 (I). D 3D structures of GshB (AlphaFold) with Thr93 and coloured by functional domains. Amino acid residues (Asp92, Ile96, and Tyr97) interacting with Thr93 (green) are shown in sticks models. E 3D structures of GshB (AlphaFold) with Ile93 and coloured by functional domains. Amino acid residues interacting with Ile93 (orange) are shown in sticks models.

Hot Topics

Related Articles