A strategy to detect metabolic changes induced by exposure to chemicals from large sets of condition-specific metabolic models computed with enumeration techniques | BMC Bioinformatics

Transcriptomic data processingTranscriptomic data used in this study were obtained from the Open TG-GATEs [21] database. For this study, we selected in vitro data generated on PHH. Raw transcriptomic data were downloaded as CEL files from https://dbarchive.biosciencedbc.jp/en/open-tggates/download.html. CEL files were read with the affy [63] package and the resulting dataset was normalized with the robust multiarray average method [64] from the limma R package. Information about the different PHH cell batches used in the different experiments was obtained from the authors (S6 Table) and was used to correct the batch effect with Combat [65], available in the SVA package. We annotated the probes with the annotation database corresponding to the Affymetrix HG133U Plus 2 chips available in the hgu133plus2.db package [66] and the AnnotationDbi package [67]. Where several different probes mapped to the same gene, we selected the probe with the highest standard deviation of expression values in the dataset. Finally, to match DEXOM [34] requirements, we categorized transcriptomic data with Barcode [68,69,70]. We used the z-scores output available from the barcode package and applied a 25/75 percentiles cutoff to obtain a list of highly and lowly expressed genes. Genes below the 25th percentile were considered as lowly expressed genes and assigned a − 1 value, while genes above the 75th percentile were considered as highly expressed genes and assigned a + 1 value. Genes between the 25th and 75th percentiles were not considered (i.e., 0 value) and therefore did not have any impact on the modeling process.DEG identificationLists of DEGs were obtained from the ToxicoDB [20] database, which provides pre-processed differential gene expression data from several databases, including Open TG-GATEs. A jupyter notebook querying ToxicoDB API was developed to automate the process. First, it downloads the ToxicoDB compounds json file and iterates over the list of compounds obtained from this file to query the ToxicoDB API to get analyzed data associated with each compound. The retrieved data were then filtered by selecting genes from the Open TG-GATEs human study with an absolute value of log2(FC) larger than 0.26 and a q-value less than 0.05, from samples subjected to a “high” dose (three doses were investigated and available for this dataset: low, middle, and high) over 24 h. The filtered DEGs list for each compound was then stored in a.tsv file.Metabolic model preparationWe used the Recon2.2 [26] (downloaded from https://www.ebi.ac.uk/biomodels/MODEL1603150001) GSMN as the initial framework for the modeling and network analysis steps. This GSMN is composed of 5324 metabolites, 7785 reactions, and 1675 genes. We modified the model biomass reaction, to account only for the precursors required for cell maintenance but not replication, in order to better represent the fact that PHH are differentiated cells with a short lifespan and do not proliferate under classical culture conditions [71]. To do so, we set the stoichiometric coefficient of the DNA precursor in the biomass reaction to zero and applied a correction coefficient (S1 Appendix) to other metabolites to keep the reaction balanced for the production of one unit of biomass. Next, we forced the lower bound of the modified biomass reaction, which is now representing the cost of maintaining the cell viability, to a value of 1 instead of 0 to ensure that the generated models will be able to maintain cell viability (i.e., have a non-zero flux through the maintenance reaction). Exchange reactions were left unconstrained as we did not have information on cell medium composition.Condition-specific modeling with partial enumerationWe integrated the categorized transcriptomic data into the GSMN through GPRs to define an a priori set of active and inactive reactions. A highly expressed gene will be given the value of 1 whereas a lowly expressed gene will be given the value of − 1. Genes are associated with reactions by the GPRs. GPRs are Boolean rules that indicate which genes are required to produce a specific enzyme that catalyzes one or more reactions. When a reaction has an AND GPR, the algorithm will annotate the reaction as active if the minimal value of the categorized GPR’s gene expression values equal 1 (i.e., all genes of the GPR are highly expressed) and inactive otherwise, such as:$$\begin{gathered} For\;Gene1 = 1,\;Gene2 = – 1,\;Gene3 = 1,\;Gene4 = – 1: \hfill \\ Gene1\;AND\;Gene2 = \min \left( {Gene1,Gene2} \right) = \min \left( {1, – 1} \right) = – 1 \hfill \\ \end{gathered}$$In this case, if Gene1 is highly expressed AND Gene2 is lowly expressed, the reaction is considered inactive.When a reaction has an OR GPR, the algorithm will annotate the reaction as active if the maximal value of the categorized GPR’s gene expression values equal one (i.e., at least one gene of the GPR is highly expressed) and inactive otherwise, such as:$$Gene1\;OR\;Gene2 = \max \left( {Gene1,Gene2} \right) = \max \left( {Gene1,Gene2} \right) = \max \left( {1, – 1} \right) = 1$$In this case, if Gene1 or Gene2 is highly expressed, the reaction is considered active.These rules can be applied to more complex GPRs such as:$$\begin{gathered} \left( {\left( {Gene1\;OR\;Gene2} \right)AND \left( {Gene3\;OR\;Gene4} \right)} \right) = \min \left( {\max \left( {Gene1,Gene2} \right),\max \left( {Gene3,Gene4} \right)} \right) \hfill \\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad = \min \left( {\max \left( {1, – 1} \right),\max \left( {1, – 1} \right)} \right) = \min \left( {1,1} \right) = 1 \hfill \\ \end{gathered}$$In this case, if Gene1 or Gene2 is highly expressed and Gene3 or Gene4 is highly expressed, the reaction is considered active.A reaction will be identified as a priori active if the resulting value equals 1 and as a priori inactive if the resulting value equals − 1. At the end of this process, we identified a list of a priori active/inactive reactions for each sample that was then provided to DEXOM. DEXOM is a constraint-based modeling algorithm based on iMAT [29, 30] whose objective is to find a steady state distribution of flux that maximizes the number of reactions whose flux is consistent with transcriptomic data levels [30].To extract a representative set of solutions from the whole solution space, we adapted the partial enumeration from DEXOM and used the python implementation available in dexom-python (https://github.com/MetExplore/dexom-python). First, we applied a full Reaction-enum strategy (see [34] for the complete description). This method iterates over all the reactions of the network, blocking each of them successively and solving the resulting mixed integer linear problem. Next, we applied the DEXOM Diversity-enum strategy, starting with a random solution picked per range of Reaction-enum solutions as a starting point. Diversity-enum aims to find new solutions that are gradually more different from the starting solution, allowing one to explore the limits of the solutions’ space. To reduce the computational costs of the original DEXOM partial enumeration approach, we reduced the number of starting solutions to a set of 1% of the solutions obtained with the Reaction-enum approach. We used an adapted version of systematic sampling (i.e., one random solution picked per batch of solutions, S1 appendix for details) to generate a starting set representative of the complete set of Reaction-enum solutions. Finally, all the optimal solutions for a sample were grouped in a single tabulation-separated file. The list of parameters used for running the DEXOM algorithm can be found in S1 Appendix.Identification of perturbed reactionsFor each reaction, the predicted activation status across all optimal condition-specific metabolic networks enumerated by the DEXOM strategy (Table 1) is stored in a numeric vector. Therefore, for each condition, the output of the DEXOM enumeration is a matrix of binary vectors where columns correspond to reactions and rows correspond to enumerated optimal solutions (Table 3).Table 3 DEXOM output example for one condition. Alternative solutions obtained with partial enumeration are stored as binary vectorsFor each reaction and each condition, we can compute an activation frequency value f, as the number of solutions in which the reaction is predicted to be active, over the total number of solutions. To compare the activation frequency values of two conditions, we introduce a metric called R2.$$R2 = \left( {\frac{{nActive_{ctrl} }}{{nTotal_{ctrl} }} – \frac{{nActive_{trt} }}{{nTotal_{trt} }}} \right)^{2} = \left( {f_{ctrl} – f_{trt} } \right)^{2}$$where \(nActive\) is the number of times the reaction is predicted active and \(nTotal\) is the total number of solutions in the condition (i.e., \(nActive + nInactive\)). \(f_{ctrl}\) is the activation frequency of a reaction under the control condition and \(f_{trt}\) is the activation frequency of the same reaction under the treated condition.We considered reactions as perturbed if the R2 value was higher than 0.2 when comparing treated versus non-treated conditions. This is a rather conservative threshold since it means that to be considered as differentially activated, a reaction needs to be at least almost twice (80%) more/less active in the treated condition compared to the non-treated condition.Baseline noise calculationThe absence of gene expression information for some reactions, and therefore the absence of constraints on those reactions when using the DEXOM algorithm, leads to uncertainty (or “noise”) in the reaction activation frequency prediction. Indeed, in the absence of transcriptomic data on reactions, DEXOM can predict these reactions as active or inactive without any impact on the optimality score. This uncertainty could lead to biased conclusions when comparing two conditions. To avoid considering reactions that are loosely constrained as DAR, we implemented a method that aims at estimating the baseline noise associated with each reaction in the network. The noise refers to the variation of the activation frequency of each reaction between pairs of control conditions, which is due to the fact that the predicted activity of the reaction does not impact the consistency with the data. Because several control conditions and chemical exposure times have been used in this dataset, we computed the baseline noise for each reaction between control conditions at a specified time and a specified vehicle. Replicates were pooled for each molecule. Practically, for each reaction, we computed the R2 between all pairs of control conditions for a given vehicle at a given exposure time and calculated the median of all R2s for each reaction. Therefore, we obtained for each reaction a baseline noise estimation that is the median of all pairwise comparisons between selected control conditions.Then we filtered out perturbed reactions with an R2 between two conditions of interest (e.g., control vs. treatment) that was less than two times the noise estimate for this reaction. The remaining perturbed reactions are considered as DARs.Metabolic reaction graph constructionA metabolic reaction graph can be formally defined as a directed graph G = (V,E) where:V is a set of vertices, where each vertex (\({v}_{i}\)) corresponds to a distinct metabolic reactionE is a set of directed edges, in which reaction (\({v}_{i}\)) and reaction (\({v}_{j}\)) are connected by an edge (\({v}_{i},{v}_{j}\)) if a metabolite produced by reaction (\({v}_{i}\)) is a substrate of reaction (\({v}_{j}\)).Edges can be annotated with the name of the metabolite consumed/produced by the two reaction nodes it connects. The metabolic reaction graph formalism allows to focus the attention on metabolic reactions and the interactions between these reactions therefore it is well adapted in our approach to study the identified DARs.GSMNs, which are large and complex metabolic networks, contain hub nodes (i.e., nodes connected to many other nodes). In fact, as shown by Arita et al. [72], these compounds will create shortcuts in the network since they are connected to a large number of reactions. Hence, these side compounds will strongly impact any topology based algorithms like path search or subgraph extraction by resulting in non-relevant biochemical results [73]. Therefore, removing these side compounds is necessary to obtain a metabolic reaction graph topology more biochemically representative of the direct interactions between metabolic reactions and allow a precise understanding of the links and interdependencies between DARs. To solve this issue, we identified a list of “side compounds” corresponding to the main hub nodes of the network (S4 Table) that will be removed during the metabolic graph construction. To improve the performance of the graph theory methods, we also identified a list of reactions (S6 Table) to be excluded during the reaction metabolic graph construction. These reactions are blocked reactions (i.e., reactions that cannot carry a flux in the GSMN), reactions always inactive in our cellular model, pool, sink, and exchange reactions. Metabolic reaction graphs were constructed according to the parameters defined previously before applying any graph-based methods such as the distance matrix calculation and subnetwork extractions.Reaction distance matrix calculationWe computed the pairwise distance between reactions of interest by computing all the shortest paths between the nodes of the network corresponding to these reactions, resulting in a reaction pairwise distance matrix. To that purpose, we implemented a java app using the Met4j metabolic network toolbox (https://forgemia.inra.fr/metexplore/met4j). This app takes as input the GSMN (i.e., Recon2.2) SBML file, a list of side compounds, a list of reactions to exclude, and a list of reactions between which the distances will be computed (e.g., DARs in our case).After loading the SBML file, the GSMN is pruned by removing side compounds and reactions to exclude. Then the JGraphT [74] ManytoMany shortest path implementation [75] is used. This implementation has been preferred over more common implementations such as Dijkstra [76] or Floyd–Warshall [77] due to its performance and its ability to take a specific list of reactions as input. The distance matrix is then saved in a comma separated file.DARs clusteringFor each DARs distance matrix computed with Met4J, we performed a hierarchical clustering with the Ward algorithm implemented in the SciPy python module. To control the size of clusters, we visualized each clustering result with a dendrogram representation and manually determined the number of clusters. Clusters were then obtained with the cutree function.Subnetwork extractionFor each cluster of DARs, we extracted a minimal subnetwork with the minimal Steiner tree algorithm implemented in Met4J. This algorithm searches for a minimum spanning tree that contains the set of DARs and a minimum set of nodes from the GSMN to connect them according to the initial network topology.Subnetwork visualizationWe visualized the extracted subnetworks with MetExploreViz [40]. MetExploreViz is a JavaScript library dedicated to visualization for GSMNs that is integrated within the MetExplore [78] webserver. The list of metabolic reactions contained in the subnetwork is first mapped on the selected GSMN with MetExplore (https://metexplore.toulouse.inrae.fr/index.html/). From this mapping, we can then visualize and prune (i.e., remove side compounds) the bipartite metabolic graph interactively with MetExploreViz. Two visualizations were done for each of the subnetworks. One with mapped up-activated DARs colored in red and down-activated DARs colored in green (Figs. 5A and 6A), and a second one with reaction links colored according to the pathway associated with the reaction (Figs. 5B and 6B). Interactive visualizations were saved online and are accessible via links provided in supplementary materials.Workflow implementationWe partitioned the workflow into three jupyter notebooks with a common properties file. The first notebook, named “partial_enumeration.ipynb,” takes as input barcode z-scores and generates batch files to execute the partial enumeration protocol on a SLURM computing cluster. Of note, these batch files can also be executed directly on a standard Linux (i.e., Ubuntu, Debian, etc.) operating system. The second notebook, named “dars_calculation.ipynb,” takes as input the partial-enumeration results from the previous notebook and computes baseline noise and DARs. Finally, the third notebook, named “analysis.ipynb,” handles the network analysis step of our workflow, which includes computing the distance matrix for each list of DARs, hierarchical clustering, and extracting subnetworks with Met4J.These notebooks provide an example as to how to run the pipeline from the datasets used in this paper. The code has been packaged and published on pypi (https://pypi.org/project/manamodeller/) with its associated API documentation (https://manamodeller.readthedocs.io/en/latest/?badge=latest). It provides the required functions to run the pipeline and apply it to other datasets. Python library code and jupyter notebooks are available on GitHub: https://github.com/LouisonF/MANA.Computing environmentCondition-specific modeling and partial enumeration requires CPLEX v20.10 to be installed on your system. Batch files generated by the “partial_enumeration.ipynb” notebook are designed to be launched on a SLURM computing grid but can also be launched on a standard Linux operating system. A local version of dexom-python is also required and can be cloned from the dexom-python repository (https://github.com/MetExplore/dexom-python). Finally, the network analysis workflow calling Met4J apps requires at least Java 11 to be installed on your machine.

Hot Topics

Related Articles