SPLANG—a synthetic poisson-lognormal-based abundance and network generative model for microbial interaction inference algorithms

Our synthetic Poisson-lognormal-based abundance and network generative model (SPLANG) draws on and extends the Poissson-lognormal generative model proposed by of Chiquet et al.27. In this section, we start with a summary of Chiquet et al. ’s methodology provided for the reader’s convenience. Our extensions to the model have been noted explicitly when introduced.Abundance-based microbe interaction inference algorithms take readings on the observable quantity of underlying abundance (typically sequencing readings) and infer interactions between every pair of microbes observed in the sample. While inferring the strength of the interaction is possible if it is proportional to the correlation measures, it is more likely to be less reliable, because taxa interactions exhibit correlated measures while correlated measures do not guarantee the existence of taxa interaction. Inference algorithms usually classify interactions between taxa into three categories denoted by \(\{-1, 0, +1\}\), where negative (-1) denotes a competitive relation, positive (+1) denotes a mutalistic relation, and (0) denotes either a commensalistic relation or having insufficient evidence to determine a competitive or mutalistic relation19,26,32. The interaction between microbes forms a directed weighted graph, where each microbe is a vertex, and the impact of one microbe on the other is an edge with weight in \(\{-1, 0, +1\}\). A generative model for microbial interaction inference algorithms is expected to generate synthetic readings of abundance based on the input of a directed-weighted graph of microbe interactions, forming the ground truth ecology that the inferencing algorithm would try to uncover. Depending on the time range of generated synthetic readings of abundance, a generative model may create readings of the abundance at multiple instants over time, or generate instantaneous abundance readings. Our SPLANG, derived from Chiquet27, generates instantaneous abundances, rather than a time series such as output from simulations based on Lotka-Volterra models21.Poisson-lognormal modelThis section summarizes the Chiquet model for reader convenience11,26,27. Our extensions to the model are presented in the following section. The Poisson-lognormal model (PLN) improved Gaussian graphical models (GGM) for count data such as microbial abundance readings11,26,27. Both PLN and GCM assume that the interaction between a pair of microbes around an instant can be reflected by partial correlation of abundance readings. Partial correlation is used instead of bivariate correlations, such as Pearson correlation, because one microbe can have interactions with multiple other microbes, and using partial correlation to measure the interaction between a pair of microbes can diminish the impact from other microbes within the group. A partial correlation \(\rho _{ij} \in [-1, 1]\) can be interpreted as

when \(\rho _{X,Y} = -1\), excluding the impact from other microbes in the group, measurements of microbes X and Y are perfectly negatively linearly correlated.

when \(\rho _{X,Y} = 0\), excluding the impact of other microbes in the group, measurements of microbes X and Y not linearly correlated, that is, measurements of microbes X and Y are not linearly correlated.

when \(\rho _{X,Y} = 1\), excluding the impact of other microbes in the group, the microbe measurements X and Y are perfectly positively linearly correlated.

For a group of p microbes, we have n independent observations of their abundance \(Y_i, i=1, \cdots , n\). Each observation \(Y_i\) is a p dimensional count vector. A PLN connects the abundance readings \(Y_i\) with the partial correlation \(\rho _{ij}\) for each pair of microbes in the following way:$$\begin{aligned} Z_i&\sim \mathcal {N}(\varvec{0}_p, \varvec{\Sigma }) \end{aligned}$$
(1)
$$\begin{aligned} Y_{ij} \,|\,Z_{ij}&\sim \mathcal {P}(e^{\mu _j + Z_{ij}}) \end{aligned}$$
(2)
$$\begin{aligned} \varvec{\Omega } = (\omega _{ij})&= \varvec{\Sigma }^{-1} \end{aligned}$$
(3)
$$\begin{aligned} \rho _{ij}&= – \frac{ \omega _{ij} }{ \sqrt{\omega _{ii} \omega _{jj}} } \end{aligned}$$
(4)
where \(Z_i\), a \(p \times 1\) vector, is the hidden state and is assumed to follow a zero-mean multivariate normal distribution with the \(p \times p\) covariance matrix \(\varvec{\Sigma }\). The inverse of the covariance matrix \(\varvec{\Omega }\) is referred to as the precision matrix with its elements denoted \(\omega _{ij}\; i, j \in \{1, \cdots , p\}\). The partial correlation \(\rho _{ij}\) between a pair of microbes can be calculated from the precision matrix, from the definition of partial correlation in Equation 4, where the precision matrix has \((\omega _{ij}) \ge 0\), \(\forall i, j.\; \rho _{ij} \le 0\).The PLN model is generic and promises to generate synthetic observations on microbe abundance and ground truth interactions to evaluate classification algorithms. Chiquet et al.27 proposed an approach to generate synthetic abundance readings in two steps: network generation and compositional data generation. The output of network generation is a precision matrix \(\varvec{\Omega }\) that embeds the latent network between species. In the compositional data generation step, Chiquet et al. introduced a model that is commonly used in community ecology and genomics to simulate common errors introduced by metagenomic sampling techniques.Network Generation The network generation step employs a network generative model, such as the Erdös-Rényi (ER) model43 or the preferential attachment model44, to generate a graph manifested as a binary adjacency matrix. The precision matrix, the inverse of the covariance matrix, is generated and drawn from a log-normal model in the compositional data generation step. The network is generated from Chiquet et al. by:

1.

Generate a binary adjacency matrix \(\varvec{G}\) using a network generative model.

2.

Compute \(\tilde{\varvec{\Omega }} = \varvec{G} \times v\).

3.

Compute \(\varvec{\Omega } = \tilde{\varvec{\Omega }} + {{\,\textrm{diag}\,}}( |\min ({{\,\mathrm{\varvec{eig}}\,}}(\tilde{\varvec{\Omega }})) |+ u)\)

where \(u, v > 0\), v is the scale factor and u is the offset factor. Chiquet et al. set \(v = 0.3\) and \(u = 0.1\) in their simulations and we used the same values for consistency with prior work. Comparing with Chiquet et al., we included preferential attachment model to generate directed graph.Both \(u, v > 0\) are scalars, where v determine the magnitude of off-diagonal items and u the magnitude of diagonal items in the precision matrix \(\varvec{\Omega }\). By definition, all elements in \(\varvec{\Omega }\) are nonnegative \(\forall i, j.\; \omega _{ij} \le 0\). Moreover, as Chiquet et al. claimed, \(\varvec{\Omega }\) is positive definite, that is, \({{\,\mathrm{\varvec{eig}}\,}}(\varvec{\Omega }) > 0\). The interaction between two microbes i, j is reflected by the partial correlation \(\rho _{ij}\), which we have to calculate from \(\varvec{\Omega }\) using Equation 4.The Erdös-Rényi model generates random undirected graphs and the preferential attachment model generates a network with the scale-free property. The generative model proposed by Chiquet et al.27, requires that the adjacency matrix, which encodes the underlying graph of the microbial interaction, be symmetric. A symmetric matrix limits the types of ecological interactions which can be simulated, as, for example, commensalistic microbes, such as scavengers would benefit from the presence of many species but would be marginally beneficial in return. Due to the symmetric positive-defined characteristic of the covariance matrix \(\varvec{\Sigma }\), the precision matrix \(\varvec{\Omega } = \varvec{\Sigma }^{-1}\) is also symmetric and positive-defined. Regardless of the composition data generation step, Chiquet’s original model can only have$$\begin{aligned} \forall i, j.\; \rho _{ij} = – \frac{ \omega _{ij} }{ \sqrt{\omega _{ii} \omega _{jj}} } < 0 \end{aligned}$$
(5)
because \(\forall i, j.\; \omega _{ij} > 0\).This indicates that, excluding the impact of other microbes, the interactions of a pair of microbes are linearly negative correlated or not linearly correlated, because the diagonal of the precision matrix \({{\,\textrm{diag}\,}}(\varvec{\Omega }) > 0\), \(\forall i.\; \rho _{ii} < 0\), which implies that, excluding the impacts of other microbes, every microbe is negatively linearly correlated with itself.Compositional data generation In the compositional data generation stage, Chiquet et al. improved Gaussian graphical models (GCM) by accounting for the sampling inherent in metagenomic sequencing. This improvement reflects the errors that can be expected to be introduced during the sequencing/sampling stage. This compositional data generation model’s expected abundance may not reflect the reality because it treats the abundance dynamics as noise around a (perhaps non-existing) mean value of abundance. However, this assumption likely holds for ecologies in a quasi-stable state. This compositional data generation step can also express the expected abundance in a linear regression model, which allows environmental factors to be incorporated. However, this regression approach, as the author noted in the paper, is limited by the accuracy of the regression.The compositional data generation stage take a precision matrix from the network generation stage as the input, and generates synthetic sequencing data with the following steps:

1.

Draw the underlying abundance \(\varvec{a}_{i} \sim \mathcal{L}\mathcal{N}(\varvec{XB}, \varvec{\Omega }^{-1})\), where \(\varvec{XB}\) is of shape \(n \times p\), represents n observations, with one row per observation, and each row contains the expected abundance for each of the of p species. The matrix \(\varvec{X}\) is the matrix of \(n \times d\) covariates, with each row indicating the value of the d covariates in the i-th observations (\(i = 1, \cdots , n\)). Covariates could be factors that may have an impact on the expected abundance, such as temperature and humidity. The matrix \(d \times p\) \(\varvec{B}\) is the corresponding matrix of regression coefficients, each column indicating the coefficients (weights) of the corresponding factor.

2.

Transform abundance \(\varvec{a}_i\) to proprotions \(\varvec{\pi }_i\) with a logistic transformation, that is, \(\pi _{ij} = \frac{ e^{a_{ij}} }{ \sum _j e^{a_{ij}} }\).

3.

For a random value of \(N_i\) that indicates the sampling effort in sample i (corresponding to the sequencing depth in reality), draw the observed counts \(Y_i\) through a multinomial distribution \(\mathcal {M}({N_i, \varvec{\pi }_i})\).

Extended poisson-lognormal generative modelThis section presents our extensions to the generative model proposed by Chiquet et al.27, which has the underlying ground truth that only accounts for the existence of microbial interactions (\(\{0, 1\}\)). Our generative pipeline extends the method proposed by Chiquet et al. by considering all three types of interaction (\(\{-1, 0, +1\}\)). Generating all three types of interaction in the underlying network and synthetic abundance reading facilitates the evaluation of algorithms that can infer all three types of interaction. We have introduced steps such as amplified correlations and adopted the pseudo-inverse technique to account for the asymmetric taxa interactions in mathematical processes. Our tractable mathematical process, which can generate asymmetric taxa interaction networks and associated abundance reading, is demonstrated as a generative pipeline with annotations on where assumptions or approximations were adopted (marked by read-dashed boxes in Fig. 1).Fig. 1Data-flow of the Generative Model Boxes denote intermediate results at each step and arrows denote transformations. Green text describes the relationships between intermediate steps and the reality the simulation is attempting to represent. Dashed red boxes annotate places where assumptions or approximations were undertaken.Our extended generative model enriches the topology of the generated taxa interaction network. We provided two network types for comparative studies: the Erdös-Rényi model or scale-free networks via preferential attachment using the Albert-Barabasi algorithm. The ER model generates a random network such that each pair of nodes is connected with a probability p, resulting in a binomial degree distribution with no obvious hub nodes in the generated network. The preferential attaching model has incoming new nodes preferentially attaching to existing nodes with a higher degree, resulting in the degree distribution following a power-law distribution, creating hub nodes which have significantly higher degree than most other nodes. Different underlying network types allow probing of how graph structure impacts inference algorithms.We also extended the Poisson-lognormal model by sampling weights as the underlying strength of interactions between pairs of taxa that have connections by essentially rewriting 1s in the adjacency matrix into \({-v, v}\), where v is the scale factor and u is the offset, with both u, v corresponding to their original meanings in the Poisson-lognormal generative model. We then applied steps similar to the Poisson-lognormal generative mode: we generate the partial correlation matrix from the correlation, and draw the abundance reading from a lognormal distribution simulating the stable status of the underlying abundance, then sampling sequencing data with a multi-nominal distribution to simulate the impact of counting effect. We applied the pseudo-inverse to ensure that precision matrices can be inverted into covariance matrices. The pseudo-inverse, or formally Moore-Penrose inverse, generalizes the inverse of an invertible matrix A with \((A^* A)^{-1}A^*\), where \(A^*\) is the Hermitian transpose (also known as conjugate transpose, generalizes the transpose for non-square matrices).SPLANG addresses the shortcomings of Poisson-lognormal generative model by considering the type and strength of the generated interactions between pairs of taxa. We also map errors at each step of the pipeline, easing the study of inferencing algorithms by allowing backtracking along the pipeline.AssumptionsInvariant interacts Our generative model assumes that the interactions between any two taxa are time invariant and can be classified into one of the three proposed by Faust: mutualistic, commensalistic, or competitive21.Deterministic interactions Our generative model assumes deterministic community dynamics rather than stochastic ones. Moreover, we assumed that equilibrium exists for the deterministic community dynamics.Using SPLANG to evaluate inference algorithmsTaxa interaction inference algorithms are hard to evaluate because the underlying interaction network for most available microbial datesets is unknown. SPLANG generates an underlying ground truth of taxa interaction network and observable abundance data. The pipeline of SPLANG, as shown in Fig. 1, allows us to access output after each transform, enabling investigation of the performance of inference algorithms under different circumstances. For example, as shown in Fig. 2, we can configure the generated taxa interaction network by specifying the network topology as preferential attaching (PA) network, or Erdös-Rényi (ER) with the same amount of vertices but varying by having similar number of edges (similar), fewer number of edges such that on average each vertex has 0.8 edges (sparse), and more number of edges such that 80% of unique pairs of vertices are connected (dense). We can further analyze the accuracy of inference algorithms on taxa with different characteristics, such as those taxa with low abundance. We can also classify taxa by their ground truth interaction with other taxa, such as taxa with no interaction with other taxa in the network, which we refers to as isolated taxa in contrast to non-isolated taxa.Because SPLANG can generate an underlying ground truth for the taxa interaction network (as expected output) and abundance reading data (as input), it can be used to evaluated the accuracy of network inference. Such paired input and expected output can also be used to help developing algorithms to fuse existing inference algorithms. Two typical fusing algorithms are logical or-combination (OR) and majority voting (MAJ). The OR combination reflects the upper-bound of accuracy for any algorithms to fuse the predictions of a given set of inference algorithms \(\{f_\xi \},\, \xi = 1, 2, \cdots , \Xi\), where an inference algorithm \(f_\xi : \varvec{a} \rightarrow \{-1, 0, +1\}\) maps abundance data into signs \(\{-1, 0, 1\}\) indicating the inferred interaction between a pair of taxa as competitive (-1), commensalistic or insignificantly mutualistic/competitive (0), and mutalistic (+1). Given abundance data \(\varvec{a}\), the OR combination considers true positive (TP) of fused inference algorithms as$$\begin{aligned} \textrm{TP} := \vee _{\xi }\; I(g_{ij} = f_\xi (\varvec{a})) \end{aligned}$$
(6)
where \(g_{ij} \in \{-1, 0, 1\}\) is the underlying interaction, \(I(\cdot )\) is the indicator function, and \(\vee _\xi\) is the logical OR over \(\xi\) items. Essentially, the OR-combination assumes the oracle algorithm picks the output of inference algorithms which correctly infer the taxa interaction if any of the algorithms has selected the correct answer, meaning OR serves as an upper bound estimate for the accuracy of a decision fusion algorithm operating across the chosen inferencing approaches. This is impossible in practice, as it requires knowing a priori which inferencing algorithm picked correctly (hence oracle algorithm), but provides an empirical upper bound for data fusion of all included inferencing algorithms. The OR algorithm provides an indication of the maximum accuracy which could be achieved fusing the outputs of the chosen algorithms. In contrast, the majority (MAJ) combination is the simplest algorithm to combine inferencing algorithms, which picks the most common answer across all algorithms, and randomly picks one if there was a tie, and can be realized in practice. The MAJ algorithm considers the true positive of fused inference algorithm as$$\begin{aligned} \textrm{TP} := I\bigg (g_{ij} = \textrm{mode}\big (\{f_\xi (\varvec{a})\}_{\xi = 1, \cdots , \Xi }\big )\bigg ) \end{aligned}$$
(7)
where \(\textrm{mode}(\cdot )\) is the mode function that returns a dataset’s most frequent value.Predicting the underlying interaction \(g_{ij} \in \{-1, 0, 1\}\) directed from taxa i to taxa j is essentially a multi-class classification problem. To demonstrate the utility of our proposed generative model in the following experiments, we employed a simple definition of accuracy as$$\begin{aligned} \textrm{Accuracy} := \frac{\sum _{c \in \{-1, 0, 1\}} \textrm{TP}_c}{\sum _{c \in \{-1, 0, 1\}} \textrm{TP}_c + \textrm{FP}_c} \end{aligned}$$
(8)
that is, the ration of total correct predictions over the total number of instances. This is essentially the micro-averaged precision, and in the multi-class classification problem, it is equal to micro-averaged recall and micro-averaged F1 score. For both the micro-averaged precision and the micro-averaged recall, the nominator is the sum of the diagonal of the confusion matrix, and the denominator ultimately represents the sum of the entire confusion matrix. In a multi-class confusion matrix organized with the predicted classes as rows and the actual classes as columns, for a class c, \(\textrm{TP}_c + \textrm{FP}_c\) represents the sum of the row corresponding to the predicted class c, while \(\textrm{TP}_c + \textrm{FN}_c\) represents the sum of the column of the actual class c. Our focus is on demonstrating the utility of the generative model; for more rigorous comparisons of inference algorithms, complex metrics such as cross entropy can be employed.Fig. 2Benchmark scenarios we evaluated four different microbes interaction inference algorithms under four scenarios: unfiltered underlying taxa interaction network (orange), non-isolated taxa interaction network (blue), non-rate taxa interaction network (green), and non-rare non-isolated taxa interaction network (purple). The data-flow of taxa filtering are shown in (a), and Venn diagram of taxa is shown in (b). Mapping between typical algorithms generated labels and types of taxa interactions in practice is illustrated in (c), where taxa interactions that are weak competitive and weak mutalistic are usually labeled the same as commensalistic. Paired underlying taxa interaction and abundance data generated by our SPLANG can help reveal and examine this type of mapping. The arrows in (a) denote the data flow and how we get non-isolated taxa (blue), non-rare taxa (green), and non-isolated non-rare taxa network (purple) from the underlying taxa interaction network (orange). The round-rectangle boxes indicate data (operand), and parallelogram boxes indicate filtering process (operator).

Hot Topics

Related Articles