BayesianSSA: a Bayesian statistical model based on structural sensitivity analysis for predicting responses to enzyme perturbations in metabolic networks | BMC Bioinformatics

Our modelIn SSA, the predictive response is obtained by considering all cases of \(\textbf{r}\). \(\textbf{r}\) depends on \({\left\{ {k_j}\right\} }_j\) and \({\left\{ {x_m}\right\} }_m\), which, in turn, depend on environmental conditions, such as aerobic/anaerobic states, pH levels, and nutrient availability, as well as individual differences among microbes. The SSA qualitative response prediction described in the previous section thus focuses on the general consequences of responses drawn from structural information alone. We propose a Bayesian statistical model based on SSA, named BayesianSSA, to integrate environmental information extracted from perturbation data into SSA predictions. We consider a probability of each \(\textbf{r}\) value, regarding \(\textbf{r}\) as a stochastic variable. The BayesianSSA posterior probability of \(\textbf{r}\) reflects perturbation data and thus extracts environmental information. Predicting responses on the basis of the posterior distribution of \(\textbf{r}\) and positivity confidence values enables us to integrate the environmental information into SSA predictions. A positivity confidence value, which we proposed in this study, indicates the probability that the predictive response is positive. The positivity confidence values enable us to interpret indefinite predictions stochastically. We will describe the details in the “Positivity confidence value” section.Prior distribution and likelihoodIn this section, we describe the likelihood function and the prior distribution of BayesianSSA. Suppose that we have a perturbation dataset \(\textbf{y}\), which shows the signs of experimentally observed responses. We define the perturbation record, an element of \(\textbf{y}\), as follows:$$\begin{aligned} y_i :={\left\{ \begin{array}{ll} 1 & \text{if} \; \text{increase} \\ -1 & \text{if} \; \text{decrease} \end{array}\right. } , \end{aligned}$$
(2)
where \(y_i\) is the i-th perturbation record, obtained from the \((m_i, j_i)\) experiment, and “\(\text{increase}\)” and “\(\text{decrease}\)” indicate the cases where the observation target increases and decreases when the perturbation target is perturbed, respectively (see the “Qualitative response prediction” section for the definition of observation and perturbation targets). We consider the probability that \(y_i\ne q_{m_i,j_i}(\textbf{r})\) because experimental errors may occur even if \(q_{m,j}(\textbf{r})\) makes the correct prediction, assuming that the likelihood function is the following:$$\begin{aligned} p(y_i | \textbf{r}, \varvec{\rho }) = {\left\{ \begin{array}{ll} \rho _{m_i,j_i}& \text{if} \; y_i = q_{m_i,j_i}(\textbf{r})\\ 1 – \rho _{m_i,j_i}& \text{if} \; y_i \ne q_{m_i,j_i}(\textbf{r})\end{array}\right. } , \end{aligned}$$
(3)
where \(\rho _{m,j}\in (0, 1)\) is a parameter indicating the reliability of the (m, j) experiment, and \(\varvec{\rho }\) is a matrix whose (m, j) element is \(\rho _{m,j}\). The likelihood function means that the probability of \(y_i\) is \(\rho _{m_i,j_i}\) if \(\textbf{r}\) can accurately predict \(y_i\), and is \(1-\rho _{m_i,j_i}\) otherwise. In other words, BayesianSSA assumes that the result of the i-th experiment can stochastically vary in accordance with the probability of \(\rho _{m_i,j_i}\). \(\rho _{m,j}\) is different for each (m, j), and each (m, j) experiment is assumed to have different reliability in BayesianSSA. This assumption is reasonable because the distribution of measured values is supposed to be different for each (m, j) experiment.We consider the prior distribution of \(\rho _{m,j}\) as$$\begin{aligned} p(\rho _{m,j}) = \mathcal {B}\left( \rho _{m,j} | a, b \right) , \end{aligned}$$where \(\mathcal {B}\left( \cdot | a, b \right)\) denotes the beta distribution probability density function with parameters \(a \in \mathbb {R}_{>0}\) and \(b \in \mathbb {R}_{>0}\). The prior distribution of \(\textbf{r}\) should be chosen in accordance with the constraint on \(\textbf{r}\). A typical constraint on \(r_{j,m}\) is \(r_{j,m}>0\) with the m-th metabolite being the substrate of the j-th reaction, and we consider only such type of constraints in this study. We used a weighted empirical distribution with samples drawn from a log-normal distribution as the prior distribution. Specifically, the prior distribution of \(\textbf{r}\) is the following:$$\begin{aligned} p(\textbf{r}= \textbf{r}^{(v)})&= \mathcal{W}\mathcal{E}\left( \textbf{r}=\textbf{r}^{(v)} | \textbf{w} \right) \nonumber \\&= w_v, \end{aligned}$$
(4)
where \(\mathcal{W}\mathcal{E}\left( \textbf{r}=\textbf{r}^{(v)} | \textbf{w} \right)\) denotes the weighted empirical distribution probability mass function with a stochastic variable \(\textbf{r}\), a weight parameter \(\textbf{w}={\left( {w_1, \ldots , w_V}\right) }^{\text{T}}\), and a parameter sample set \({\left\{ {\textbf{r}^{(v)}}\right\} }_{v=1}^{V}\), \(w_v>0\) and \(\sum _{v=1}^{V} w_v = 1\), and V is the size of the sample set. Here, we generated the parameter sample set \({\left\{ {\textbf{r}^{(v)}}\right\} }_{v=1}^{V}\) as follows:$$\begin{aligned} \textbf{r}^{(v)}&\sim \mathcal{L}\mathcal{N}\left( \varvec{\mu }, \varvec{\Sigma } \right) , \end{aligned}$$where \(\mathcal{L}\mathcal{N}\left( \varvec{\mu }, \varvec{\Sigma } \right)\) denotes the log-normal distribution with parameters \(\varvec{\mu }\in \mathbb {R}^{P}\) and \(\varvec{\Sigma }\in \mathbb {R}^{P \times P}\) (\(\varvec{\mu }\) and \(\varvec{\Sigma }\) correspond to the mean and covariance matrix parameter of the normal distribution, respectively), and \(a \sim \mathcal {D}\) indicates that a stochastic variable a is drawn from a distribution \(\mathcal {D}\).The purpose of BayesianSSA is to obtain a better distribution \(p(\textbf{r})\) for calculating positivity confidence values (Eq. (7)). Therefore, we need the marginal posterior distribution \(p(\textbf{r}|\textbf{y})\). The marginal likelihood is obtained as$$\begin{aligned} p(\textbf{y}|\textbf{r}=\textbf{r}^{(v)})&=\int p(\textbf{y}|\textbf{r}=\textbf{r}^{(v)}, \varvec{\rho })p(\varvec{\rho }) d\varvec{\rho }\nonumber \\&= \prod _{m=1}^{M+J}\prod _{j=1}^{J+L}\frac{\text{Beta}{\left( {\hat{a}_{m,j,v}, \hat{b}_{m,j,v}}\right) } }{\text{Beta}{\left( {a, b}\right) }}, \end{aligned}$$where \(\text{Beta}{\left( {\cdot , \cdot }\right) }\) is the beta function, \(\hat{a}_{m,j,v}=t_{m,j,v}+ a\), \(\hat{b}_{m,j,v}=f_{m,j,v}+ b\), and \(t_{m,j,v}\) and \(f_{m,j,v}\) are the number of true and false (m, j) predictions based on \(\textbf{r}^{(v)}\) in \(\textbf{y}\), respectively. Here, \(t_{m,j,v}=f_{m,j,v}=0\) for a (m, j) experiment that has not been conducted.If a continuous prior and posterior distribution is desired, one can use them and obtain samples from the posterior distribution using the Markov chain Monte Carlo (MCMC) method. Since we discretized the log-normal distribution (Eq. (4)), we can calculate the posterior distribution without approximation using MCMC (details are described in the “Calculating posterior distribution” section).Calculating posterior distributionThe marginalized posterior distribution \(p(\textbf{r}| \textbf{y})\) in BayesianSSA can be calculated by normalizing the following formula:$$\begin{aligned} p(\textbf{r}= \textbf{r}^{(v)}| \textbf{y})&\propto p(\textbf{y}|\textbf{r}=\textbf{r}^{(v)}) p(\textbf{r}=\textbf{r}^{(v)}) \\&= {\left( {\prod _{m=1}^{M+J}\prod _{j=1}^{J+L}\frac{\text{Beta}{\left( {\hat{a}_{m,j,v}, \hat{b}_{m,j,v}}\right) } }{\text{Beta}{\left( {a, b}\right) }}}\right) } w_v, \end{aligned}$$We can omit calculating the constant of this formula, and we derive that$$\begin{aligned} p(\textbf{r}= \textbf{r}^{(v)}| \textbf{y}) = \frac{g(\textbf{y}, \textbf{r}^{(v)})}{\sum _{v^{\prime }=1}^{V} g(\textbf{y}, \textbf{r}^{(v^{\prime })})}, \end{aligned}$$
(5)
where$$\begin{aligned} g(\textbf{y}, \textbf{r}^{(v)})&:=w_v \prod _{(m, j) \in \Lambda }\text{Beta}{\left( {\hat{a}_{m,j,v}, \hat{b}_{m,j,v}}\right) } , \\ \Lambda&:={\left\{ {(m, j) \mid t_{m,j,v}+ f_{m,j,v}\ne 0}\right\} }. \end{aligned}$$We implemented these calculations using the log-sum-exp trick.Calculating predictive distributionTo evaluate predictive performance, we calculate the predictive probability \(p(\textbf{y}^{\text{new}}|\textbf{y})\), where \(\textbf{y}^{\text{new}}\) is a new sample that is not used in BayesianSSA fitting. Using the posterior distribution \(p(\textbf{r}, \varvec{\rho }| \textbf{y})\) (Supplementary Section S1), the predictive probability can be obtained as$$\begin{aligned} p(\textbf{y}^{\text{new}}|\textbf{y})&= \sum _{v=1}^{V} \int p(\textbf{y}^{\text{new}}|\textbf{r}=\textbf{r}^{(v)}, \varvec{\rho }) \nonumber \\&\hspace{20mm}p(\textbf{r}=\textbf{r}^{(v)}, \varvec{\rho }|\textbf{y}) d\varvec{\rho }\nonumber \\&= \frac{\sum _{v=1}^{V} w_v\prod _{(m, j) \in \Lambda ^{\text{new}}}\text{Beta}{\left( {\hat{a}_{m,j,v}^{\text{new}}, \hat{b}_{m,j,v}^{\text{new}}}\right) }}{ \sum _{v = 1}^{V} w_{v} \prod _{(m, j) \in \Lambda ^{\text{new}}}\text{Beta}{\left( {\hat{a}_{m,j,v}, \hat{b}_{m,j,v}}\right) }} \end{aligned}$$
(6)
where$$\begin{aligned} \hat{a}_{m,j,v}^{\text{new}}&:=\hat{a}_{m,j,v}+ t_{m,j,v}^{\text{new}}, \\ \hat{b}_{m,j,v}^{\text{new}}&:=\hat{b}_{m,j,v}+ f_{m,j,v}^{\text{new}}, \\ \Lambda ^{\text{new}}&:=\Lambda \cup {\left\{ {(m, j) \mid t_{m,j,v}^{\text{new}}+ f_{m,j,v}^{\text{new}}\ne 0}\right\} }, \end{aligned}$$and \(t_{m,j,v}^{\text{new}}\) and \(f_{m,j,v}^{\text{new}}\) are the number of true and false (m, j) predictions based on \(\textbf{r}^{(v)}\) in \(\textbf{y}^{\text{new}}\), respectively. The detailed derivation is described in Supplementary Section S2.Bayesian updatingBayesian updating, a procedure where the posterior distribution is used as a prior distribution for the next estimation, can be easily applied to BayesianSSA. In BayesianSSA, updating \(\hat{a}_{m,j,v}\) and \(\hat{b}_{m,j,v}\) to \(\hat{a}_{m,j,v}^{\text{new}}\) and \(\hat{b}_{m,j,v}^{\text{new}}\) is equivalent to Bayesian updating (Supplementary Section S3). This update is also derived from the fact that the final updated posterior distribution in Bayesian updating does not depend on the order of the given perturbation data due to Bayes’ theorem [37].Positivity confidence valueThe introduction of stochastic variables \(\textbf{r}\) enables interpreting indefinite predictions in SSA. We define positivity confidence values as the probabilities that the responses are positive for each indefinite prediction. The positivity confidence value of the (m, j) prediction for a distribution \(p^{\star }(\textbf{r})\) is written as$$\begin{aligned} c_{p}(m,j)&:=\mathbb {E}_{p^{\star }(\textbf{r})}[\mathbb {I}_{q_{m,j}(\textbf{r})=1}], \end{aligned}$$
(7)
where \(\mathbb {E}_{p^{\star }(\textbf{r})}[\cdot ]\) denotes the expectation with respect to \(p^{\star }(\textbf{r})\), and \(\mathbb {I}_{a=b}\) is an indicator that is equal to 1 if \(a=b\) and 0 otherwise. A higher positivity confidence value indicates that the qualitative response is more likely to be positive, and the probability of being negative is higher than that of being positive when the positivity confidence value is below 0.5. Note that we use the predictive distributions derived in the “Calculating predictive distribution” section rather than positivity confidence values to evaluate predictive performance on perturbation datasets. This is because positivity confidence values do not consider experimental error.Used dataMetabolic network informationWe utilized the central metabolic pathway of E. coli MG1655 used in a previous study [38]. This metabolic network was originally from the EcoCyc database [39] and modified by Trinh et al. [40] and Toya et al. [38]. We preprocessed this dataset as follows:

1.

Remove the biomass objective function.

2.

Remove metabolites that have no reactions that produce or use them.

3.

Remove reactions that no longer have substrates or products due to the previous processes.

4.

Integrate cytoplasm and extracellular metabolites.

The second step is necessary to apply SSA to the network because it is typically impossible to calculate the inverse of the matrix \(\textbf{A}(\textbf{r})\) (Eq. (1)) when metabolites not involved in the flow are included in the network. Here, metabolites in flow refer to metabolites that serve as both inputs and outputs of reactions included in the network. The fourth step aims to reduce computation time, and this procedure does not alter the results from SSA. After these preprocessing steps, we converted the resulting network into the stoichiometric matrix \(\varvec{\nu }\). Constraints on \(\textbf{R}(\textbf{r})\) were determined solely by the metabolic network, where we set \(r_{j,m}> 0\) if the j-th metabolite is a substrate of the m-th reaction and \(r_{j,m}= 0\) otherwise. The resulting metabolite and reaction lists are shown in Supplementary Table S2 and Supplementary Table S3. We use the abbreviations in Supplementary Table S2 and Supplementary Table S3 in the following.Synthetic data generationWe generated synthetic data to compare BayesianSSA with random and base methods. The synthetic perturbation dataset is generated in accordance with a Bernoulli distribution with parameters generated in accordance with beta distributions. Then, we replaced all 0 with \(-1\) to match the support of the Bernoulli distribution and the range of the perturbation record \(y_i\) (Eq. (2)). We used GND, PTS, and PPC as the perturbation targets and succinate export (SUCCt) as the observation target. We used a beta distribution with an expectation is close to 0 for GND and PTS and a beta distribution with an expectation is close to 1 for PPC. These distribution settings were in accordance with previous reports [41,42,43,44]. We used \((a, b)=(0.1, 0.3)\) and \((a, b)=(0.3, 0.1)\) as parameters of beta distributions with expectations are close to 0 and 1, respectively. Note that these parameters are used only for the data generation and different from the parameters of the prior distributions, which are weakly informative. We used 30 as the number of records for each perturbation target. The obtained synthetic dataset is shown in Table 1.
Table 1 Obtained synthetic perturbation datasetWet lab experimentsWe conducted perturbation experiments for the reactions CS, FBP, ICL, LDH, ME1, ME2, PCK, PPC, PPS, and PTA. The genes corresponding to the reactions, listed in Table 2, were introduced into the plasmid vector pLEAD5 (NIPPON GENE CO., LTD.). We amplified the gene sequences using PrimeSTAR GXL DNA Polymerase (Takara Bio Inc.) with E. coli DH5\(\alpha\) (NIPPON GENE CO., LTD.) as a template sequence. The primer sequences were derived from NCBI Genes [45]. The nucleotide sequences of DNAs were analyzed using the BigDye\(\circledR\) Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems) with SeqStudio\(^{\text{TM}}\) Genetic Analyzer (Applied Biosystems). The nucleotide sequence data were processed using GENETYX-Mac NETWORK software, version 15 (GENETYX CORPORATION). We introduced the constructed plasmid vectors into E. coli JM109 (NIPPON GENE CO., LTD.). The modified strains were aerobically cultured and anaerobically fermented at 37\(^{\circ }\)C in M9 minimal medium (Thermo Fisher Scientific Inc.). To measure succinate concentrations, we used EnzyChrom Succinate Assay Kit (#ESNT-100, BioAssay Systems) and the absorbance meter SH-8000(CORONA ELECTRIC Co.,Ltd.) at 570 nm. The resulting absorbance, which is a constant multiple of the succinate concentrations, was normalized by the optical density of the bacterial liquid. We calculated the differences between the resulting values and the absorbances of controls, which are of a strain with only pLEAD5, and the signs of the values were used as the real perturbation data. We constructed and measured three replicates for each perturbed reaction. The obtained dataset is shown in Table 2. To validate the overexpression, we performed real-time PCR using QuantStudio5 Real-time PCR system (Thermo Fisher SCIENTIFIC) and confirmed the genes were successfully overexpressed (Supplementary Table S4).
Table 2 Obtained real perturbation datasetEvaluationNaive Bayes modelTo evaluate the effectiveness of incorporating SSA, we constructed a naive Bayes model as follows:$$\begin{aligned} p(y_i | \varvec{\eta }) = {\left\{ \begin{array}{ll} \eta _{m_i,j_i}& \text{if} \; y_i = 1\\ 1 – \eta _{m_i,j_i}& \text{if} \; y_i \ne 1 \end{array}\right. } , \end{aligned}$$where \(\eta _{m,j}\in (0, 1)\) is a parameter indicating the probability that \(y_i=1\) where the i-th experiment is of (m, j), and \(\varvec{\eta }\) is a matrix whose (m, j) element is \(\eta _{m,j}\). The predictive distribution of one new perturbation record \(y^{\text{new}}\) is as follows:$$\begin{aligned} p(y^{\text{new}}|\textbf{y}) = {\left\{ \begin{array}{ll} \frac{n_{m,j}^{(\text{p})}+1}{n_{m,j}^{(\text{p})}+n_{m,j}^{(\text{n})}+2} & \text{if} \; y^{\text{new}}=1 \\ \frac{n_{m,j}^{(\text{n})}+1}{n_{m,j}^{(\text{p})}+n_{m,j}^{(\text{n})}+2} & \text{if} \; y^{\text{new}}\ne 1 \end{array}\right. } , \end{aligned}$$
(8)
where \(n_{m,j}^{(\text{p})}\) and \(n_{m,j}^{(\text{n})}\) are the numbers of the (m, j) experiments in \(\textbf{y}\) where \(y_i=1\) and \(y_i=-1\), respectively. We used \(\mathcal {B}\left( \eta _{m,j} | 1,1 \right)\) as the prior distribution.Cross-entropy lossTo evaluate the performance of BayesianSSA, we used the cross-entropy loss as follows:$$\begin{aligned} \text{CE}(N) = – \sum _{i=1}^{N} \log p_i(y^{\text{new}}= y_i), \end{aligned}$$where N is the number of trials, which means how many times data are added, and \(p_i(\cdot )\) is the i-th predictive distribution of BayesianSSA or the naive Bayes model. The i-th predictive distribution is calculated using \({\left\{ {y_{i^\prime }}\right\} }_{i^\prime = 1}^{i-1}\). We used \(p_t(y^{\text{new}}=1) = 0.5\) and \(p_t(y^{\text{new}}= y_i) = p_1(y^{\text{new}}= y_i)\) as the “random method” and “base method” to calculate the cross-entropy loss, respectively, to be compared with BayesianSSA. Here, \(p_1(y^{\text{new}}= y_i)\) indicates the initial distribution of BayesianSSA.

Hot Topics

Related Articles