C-ziptf: stable tensor factorization for zero-inflated multi-dimensional genomics data | BMC Bioinformatics

Bayesian poisson tensor factorizationTraditional tensor factorization methods using MLE are unstable when applied to zero-inflated count data [11]. Bayesian Poisson Tensor Factorization (BPTF) extends the Poisson Matrix Factorization method to higher dimensions and utilizes Bayesian inference to obtain a point estimate and offers benefits such as uncertainty quantification, realistic noise assumptions, and principled inclusion of prior information [12,13,14, 22]. This section presents a general framework for BPTF with Variational Inference (VI) for high-dimensional count data.Let \({\mathcal {X}} \in {\mathbb {R}}^{I_1 \times I_2 \times \ldots \times I_N}\) be the observed count data drawn from the Poisson distribution and with the CP decomposition as given in Eq. (2). Let \(I=i_1i_2\ldots i_N \in {\overline{I}}=\{ i_1i_2 \ldots i_N~:~1 \le i_j \le I_j,~1\le j\le N\}\), then$$\begin{aligned} \mathcal{X}_I \approx Poisson (\lambda _{I})~~~\text {where}~ {\mathcal {X}}_{I} \approx \mathcal {\tilde{X}}_{I}= \sum _{r=1}^{R} a^{(1)}_{i_{1}r} a^{(2)}_{i_{2}r} \ldots a^{(N)}_{i_{N}r} \approx \lambda _I. \end{aligned}$$
(4)
BPTF uses Gamma priors to regularize the estimation of the latent factors [23,24,25]. The Gamma distribution, which is characterized by a shape parameter \(\alpha >0\) and a rate parameter \(\alpha \beta > 0\), is employed as a sparsity-inducing prior [12, 13, 23]. Then for each \(a^{(k)}_{jr}\) in Eq. (4), we have:$$\begin{aligned} a^{(k)}_{jr} \approx Gamma(\alpha , \alpha \beta ^{(k)}),~1\le k \le N, \end{aligned}$$
(5)
with the expectation \(E[a^{(k)}_{jr}]=\frac{1}{\beta ^{(k)}}\) and \(Var[a^{(k)}_{jr}]=\frac{1}{\alpha {\beta ^{(k)}}^2}\). The posterior distribution given by \(P(A^{(1)}, A^{(2)}, \ldots , A^{(N)}~|~{\mathcal {X}}, {\mathcal {H}})\) is intractable due to the inability to compute the evidence, given a model hyperparameter set \({\mathcal {H}} = \{\alpha,\beta^{(1)},\beta^{(2)},\ldots,\beta^{(N)}\}\) [25]. BPTF uses VI and assumes a variational family of distributions \({\mathcal {Q}}_V={\mathcal {Q}}(A^{(1)},A^{(2)},\ldots,A^{(N)}; V^{(1)},\ldots,V^{(N)})\) which is indexed by a set of variational parameters \(V^{(k)}, 1\le k \le N\) [25, 26]. We employ a fully-factorized mean-field approximation assuming that \(\mathcal {Q_V}(A^{(1)},A^{(2)},\ldots , A^{(N)})= \prod _{k=1}^{N} \mathcal {Q_V}(A^{(k)}; V^{(k)}),\) where \({\mathcal {Q}}(a^{(k)}_{jr}; V^{(k)}_{jr}) = Gamma(a^{(k)}_{jr}; \gamma ^{(k)}_{jr}, \delta ^{(k)}_{jr}), 1\le k \le N.\) The variational family Q used here is similar to the one employed in Bayesian Poisson Matrix Factorization [23, 27, 28]. BPTF fits variational parameters by minimizing the Kullback–Leibler (KL) divergence between the true posterior distribution and \(\mathcal {Q_V}\), which is equal to maximizing the evidence lower bound (ELBO) [12, 25, 26]:$$\begin{aligned} ELBO(V)= E_{Q_V}\big [log\big (P({\mathcal {X}}, A^{(1)},A^{(2)},\ldots , A^{(N)}~|~{\mathcal {H}})\big )\big]+ H(Q_V), \end{aligned}$$
(6)
where \(H(Q_V)\) is the entropy for \(Q_V.\) Coordinate ascent algorithms are commonly used to maximize the ELBO by iteratively optimizing each variational parameter while fixing the others until convergence, monitored by the relative change in the ELBO [25, 26]. From Eq. (4), we have the total \(n= \sum _{I \in {\overline{I}}} {\mathcal {X}}_{I} \approx ~ Poisson(\Lambda )\) where \(\Lambda =\sum _{I \in {\overline{I}}} \lambda _I.\) We can use the Poisson-Multinomial connection to express \({\mathcal {X}}\) given n as \(Multinomial(n, \pi )\) where \((\pi )_{I}= \frac{\lambda _I}{\Lambda }\), and update variational parameters using this auxiliary distribution [13, 23, 25]:$$\begin{aligned} \gamma^{(k)}_{jr}= & {} \alpha + \sum _{\begin{array}{c} i_1 i_2 \ldots i_N~\in~{\overline{I}} \\ i_k=j \end{array}} {\mathcal {X}}_{i_{1}i_{2}\ldots i_{n}} \frac{{\mathbb {G}}_{Q_V} \big [\prod _{s=1}^{N} a^{(s)}_{i_sr}\big ]}{\sum _{t=1}^{R} {\mathbb {G}}_{Q_V}\big [\prod _{s=1}^{N} a^{(s)}_{i_st}\big ]}, \end{aligned}$$
(7)
$$\begin{aligned} \delta ^{(k)}_{jr}= & {} \alpha \beta ^{(k)} + \sum _{i_1 i_2 \ldots i_N~\in~{\overline{I}}} E_{Q_V} \big [ \prod _{ 1\le s \ne k \le N}a^{(s)}_{i_sr}~\big ], \end{aligned}$$
(8)
where \(E_{Q_V}[.]\) and \({\mathbb {G}}_{Q_V}=exp\big(E_{Q_V}\big[log(.)\big]\big)\) denote arithmetic and geometric expectations. Since \(Q_V\) is fully factorized, the expectations in Equations (7) and (8) can be expressed as a product of individual expectations [25]. Specifically, for \(a^{(s)}_{i_sr},\)$$\begin{aligned} E_{Q_V}[a^{(s)}_{i_{s}r}]=\frac{\gamma ^{(s)}_{{i_s}r} }{\delta ^{(s)}_{{i_s}r}}~~\text {and}~~{\mathbb {G}}_{Q_V}[a^{(s)}_{{i_s}r}]=\frac{exp\big(\Psi (\gamma ^{(s)}_{{i_s}r} )\big)}{\delta ^{(s)}_{{i_s}r}}, \end{aligned}$$
(9)
where \(\Psi\) is the digamma function (logarithmic derivative of the gamma function). An empirical Bayes approach can be used to update the hyperparameters \(\beta ^{(k)}, 1\le k \le N\), in conjunction with the variational parameters [23, 25]:$$\begin{aligned} \beta ^{(k)}= \big ( \sum _{j=1}^{I_k} \sum _{r=1}^{R} {\mathbb {E}}_{Q_V}[a^{(k)}_{jr}]\big )^{-1}. \end{aligned}$$
(10)
The variational inference algorithm for BPTF is fully specified by the set of update equations Equations (7), (8), and (10).Zero-inflated poisson tensor factorization (ZIPTF)Poisson models may not always be sufficient to model count data with excess zeros, and zero-inflated models can often provide a better fit [16, 17]. The Zero-Inflated Poisson (ZIP) model assumes that the counts in the tensor \({\mathcal {X}}\) can be modeled as a mixture of a point mass at zero and a Poisson distribution with parameter \(\lambda\). Let \({\mathcal {X}}\) be a count data in \({\mathbb {R}}^{I_1 \times I_2 \times \ldots \times I_N}.\) We define the index set \({\overline{I}}\) as the collection of all possible indices, i.e., \({\overline{I}}=\{ i_1i_2 \ldots i_N~:~1\le i_j \le I_j,~1\le j\le N\}.\) We say \({\mathcal {X}}\) has Zero-inflated Poisson (ZIP) distribution if for every \(I \in {\overline{I}}:\)$$\begin{aligned} P({\mathcal {X}}_{I}=x_{I})= p_I {\mathbb {1}_{x_{I}=0}} + (1-p_I) \frac{e^{-\lambda } \lambda ^{x_{I}} }{x_{I}!}, \end{aligned}$$
(11)
where the outcome variable \(x_I\) has non-negative integer values, \(\lambda _I\) is the expected Poisson count, and \(p_I\) is the probability of extra zeros [17]. As an abbreviation, we write it as \({\mathcal {X}}_I \sim ZIP (\lambda _I, p_I).\) The ZIP can be considered as the product of a Poisson random variable \({\mathcal {Y}}_{I} \sim Poisson(\lambda _I)\) and an independent Bernoulli variable \(\Phi _I \sim Bernoulli(p_I)\) [15]. The Bernoulli variable \(\Phi _I\) takes the value of 1 when \({\mathcal {X}}_{I}\) is equal to 0, due to the Bernoulli component, and takes the value of 0 otherwise.We consider the low rank \(R \ge 1\) decomposition of the zero-inflated count tensor \({\mathcal {X}}:\)$$\begin{aligned} {\mathcal {X}} \approx {\mathcal {\tilde{X}}=} \sum _{r=1}^{R} a^{(1)}_{r}\otimes a^{(2)}_{r} \otimes \dots \otimes a^{(N)}_{r}. \end{aligned}$$
(12)
Hence, for \(I=i_1 i_2\ldots i_N\), the reconstruction \(\sum _{r=1}^{R} a^{(1)}_{i_{1}r} a^{(2)}_{i_{2}r} \ldots a^{(N)}_{i_{N}r}\) can be interpreted as the mean of the distribution from which the observed count \({\mathcal {X}}_{I}\) is assumed to be sampled. Then we have:$$\begin{aligned} {\mathcal {X}}_{I} \sim ZIP (\lambda _I= \sum _{r=1}^{R} a^{(1)}_{i_1r} a^{(2)}_{i_2r} \ldots a^{(N)}_{i_{N}r}, p_{I}). \end{aligned}$$
(13)
Variational inference for ZIPTFFor given position \(I=i_1 i_2 \ldots i_N,\) we consider the rank R decomposition in Eq. (13). In Bayesian Poisson factorizations, the Gamma distribution is utilized as a prior to induce sparsity, and it is assumed that each latent factor matrix \(A^{(k)} = [a^{(k)}_{1}\ldots a^{(k)}_{R}] \in {\mathbb {R}}_{+}^{I_k \times R},~1 \le k \le N\), follows a Gamma distribution [13, 23]. Therefore, for each \(a^{(k)}_{jr}\) in Eq. (13), we have:$$\begin{aligned} a^{(k)}_{jr} \sim Gamma(\alpha ^{(k)}, \beta ^{(k)}), \quad 1\le k \le N, \end{aligned}$$
(14)
where \(\alpha ^{(k)} >0\) and \(\beta ^{(k)}> 0\) represent the shape and rate parameters of the distribution, with the expectation \(E[a^{(k)}_{jr}]=\frac{\alpha ^{(k)} }{\beta ^{(k)}}\) and \(Var[a^{(k)}_{jr}]=\frac{\alpha ^{(k)}}{{\beta ^{(k)}}^2}\). Additionally, for ZIP models a latent variable \(\xi\) is introduced to capture the hidden state of the probability of extra zeros which specify \(\Phi \sim Bernoulli (p_I)\) [16, 25]. Let S(.) denote the logistic sigmoid function, given by \(S(x)=\frac{1}{1 + e^{-x}}\), then:$$\begin{aligned} \xi = S(\zeta )~~\text {where}~~\zeta \sim Normal(\mu , \sigma ). \end{aligned}$$
(15)
Let \(Z= \{A^{(1)}, A^{(2)}, \ldots , A^{(N)}, \Phi \},\) consider the posterior distribution \(P(Z~|~{\mathcal {X}}, {\mathcal {H}}),\) given a model hyperparameter set \({\mathcal {H}} = \{\alpha ^{(1)}, \beta ^{(1)}, \alpha ^{(2)}, \ldots , \beta ^{(2)}, \ldots , \alpha ^{(N)}, \beta ^{(N)}, \mu , \sigma \}.\)Variational inference approximates the true posterior distribution using a family of probability distributions \({\mathcal {Q}}\) over hidden variables [25]. This family of distributions is characterized by free parameters, and the key assumption is that each latent variable is independently distributed given these parameters. We assume a variational family of distributions \({\mathcal {Q}}\) indexed by a set of variational parameters \(V=\{\gamma ^{(1)}, \delta ^{(1)},\gamma ^{(2)}, \delta ^{(2)}, \dots , \gamma ^{(N)}, \delta ^{(N)}, {\overline{\mu }}, {\overline{\sigma }}\}\) where \((\gamma ^{(k)}, \delta ^{(k)})\) are variational shape and rate parameters of the Gamma distribution for the latent factor along the \(k-\)th mode, and \(({\overline{\mu }}, {\overline{\sigma }})\) are the variational parameters for \(\zeta\). We use a fully factorized mean-field approximation [25] and the variational distribution factors as the following:$$\begin{aligned} {\mathcal {Q}} (A^{(1)},A^{(2)},\ldots , A^{(N)}, \Phi )= {\mathcal {Q}}( \Phi ; {\overline{\mu }}, {\overline{\sigma }}) \prod _{k=1}^{N} {\mathcal {Q}}(A^{(k)};\gamma ^{(k)}, \delta ^{(k)}), \end{aligned}$$
(16)
where \(a^{(k)}_{jr} \sim Gamma( \gamma ^{(k)}_{jr},\delta ^{(k)}_{jr})\) and \(\Phi _{I} \sim Bernoulli\big(S(\zeta )\big)~\text {for}~\zeta \sim Normal({\overline{\mu }}, {\overline{\sigma }})\). The goal is to choose a member \(q^{*}\) of the family of variational distributions which minimizes the KL divergence of the exact posterior from \({\mathcal {Q}}\):$$\begin{aligned} q^{*}(Z) = \arg \min _{q(Z)~\in~{\mathcal {Q}}} D_{KL}\big( q(Z)~ \big \Vert ~ P(Z~|~{\mathcal {X}}, {\mathcal {H}})\big) . \end{aligned}$$
(17)
Upon examining the KL divergence, we encounter a significant challenge: it involves the true posterior distribution \(P(Z~|~{\mathcal {X}}, {\mathcal {H}})\), which is not known. Nevertheless, we can rewrite the KL divergence as follows:$$\begin{aligned} D_{KL} \big( q\big(Z~\Vert~P(Z~|~{\mathcal {X}}, {\mathcal {H}})\big) \big)= & {} \int q(Z) \log \left( \frac{q(Z) }{P(Z~|~{\mathcal {X}}, {\mathcal {H}} )} \right) dZ~~~~~~~~ \end{aligned}$$
(18)
$$\begin{aligned}= & {} \int q(Z) \log \left( \frac{q(Z) P({\mathcal {X}}, {\mathcal {H}} )}{P(Z,{\mathcal {X}}, {\mathcal {H}})} \right) dZ \end{aligned}$$
(19)
$$\begin{aligned}= & {} \log \big( P({\mathcal {X}}, {\mathcal {H}} )\big) \int q (Z) dZ- \int q (Z) \log \left( \frac{P(Z,{\mathcal {X}}, {\mathcal {H}} )}{ q(Z)}\right) dZ~~~~~~~~ \end{aligned}$$
(20)
$$\begin{aligned}= & {} \log \big( P({\mathcal {X}}, {\mathcal {H}} )\big) – \int q (Z) \log \left( \frac{P(Z,{\mathcal {X}}, {\mathcal {H}} )}{ q(Z)}\right) dZ. \end{aligned}$$
(21)
The second term in  Eq. (21) is called Evidence Lower Bound (ELBO). We know that the KL divergence is non-negative, therefore, \(\log \big( P({\mathcal {X}}, {\mathcal {H}} )\big) \ge ~\text {ELBO}\big(q(Z)\big)=\int q (Z) \log \left( \frac{P(Z,{\mathcal {X}}, {\mathcal {H}}) }{ q (Z)}\right) dZ.\)$$\begin{aligned} \text {ELBO}\big(q(Z)\big) &= \int q(Z) \log \big( P(Z,{\mathcal {X}}, {\mathcal {H}} )\big) dZ – \int q(Z) \log \big( q(Z)\big) dZ \end{aligned}$$
(22)
$$\begin{aligned} &= \mathbb{E}_{q(Z)} \big[\log \big( P({\mathcal {X}},Z, {\mathcal {H}})\big) \big]-\mathbb{E}_{q(Z)}\big[\log q(Z)\big]. \end{aligned}$$
(23)
The evidence lower bound serves as a transformative tool that converts intractable inference problems into optimization problems that can be tackled using gradient-based methods [25].Coordinate ascent algorithms are frequently employed in maximizing the evidence lower bound [16, 25]. However, these algorithms require tedious gradient calculations and may not scale well for very large datasets [29, 30]. Closed-form coordinate-ascent updates are applicable to conditionally conjugate exponential family models, but they necessitate analytic computation of various expectations for each new model [29, 30].Stochastic Variational Inference (SVI) [29] offers a more efficient algorithm by incorporating stochastic optimization [31]. This technique involves utilizing noisy estimates of the gradient of the objective function. To maximize the ELBO, we employ a stochastic optimization algorithm known as the Black Box Inference Algorithm [30]. This algorithm operates by stochastically optimizing the variational objective using Monte Carlo samples from the variational distribution to compute the noisy gradient (see Sect. 2, [30] for details). By doing so, it effectively alleviates the burden of analytic computations and provides a more efficient approach to ELBO maximization.Fig. 2Overview of the consensus meta-analysis approach discussed in Section  Generic consensus-based tensor factorization for the 3-way tensor \({\mathcal {X}}\)Generic consensus-based tensor factorizationSelecting the number of components in tensor factorization is challenging [10, 21]. The dependence on initial guesses for latent factors can lead to substantially different factor sets across repeated runs, making it difficult to interpret the results [10, 19, 21]. Traditional approaches involves selecting the minimum R in Eq. (12) that surpasses a predetermined threshold for the explained variance of the approximation, defined as follows:$$\begin{aligned} \text {explained variance}=1- \frac{||{\mathcal {X}}- \mathcal {\tilde{X}}||_{F}}{||{\mathcal {X}}||_{F}}. \end{aligned}$$
(24)
As the rank R increases, explained variance also increases, owing to the increased flexibility to capture the intricacies of the original tensor \({\mathcal {X}}\). However, depending solely on explained variance for selecting the rank may not always yield the best outcomes. Higher ranks can lead to overfitting, capturing noise instead of meaningful data patterns, thus inflating explained variance. Additionally, the increased computational complexity of higher ranks may not justify the marginal improvement in explained variance. Moreover, interpreting factors from high-rank tensor decompositions becomes more challenging, potentially obscuring the underlying data structure. Our goal is not solely to improve the explained variance, but also to ensure the interpretability and stability of the factors. We explore additional metrics to assess the stability of factors, namely the cophenetic correlation and silhouette score; further details are provided below.We generalize the consensus meta-analysis approach, which has been previously used for matrix factorization [20], and include novel techniques to enhance the stability. The overview of the proposed pipeline is depicted in Fig. 2. In the remainder of this section, we will refer to the steps \(\textcircled {1}\)-\(\textcircled {5}\) given in the figure.Running a generic rank R factorization given in Eq. (12) for  \({\mathcal {X}} \in {\mathbb {R}}^{I_1 \times I_2 \times \ldots \times I_N}\) with M different random seeds yields the sets of non-negative factor matrices \(\{(A^{(1)})_{m}, (A^{(2)})_{m},\ldots , (A^{(N)})_{m} \}, ~1\le m \le M\) (Step \(\textcircled {1}\)). For a chosen modality \(k~(1\le k\le N\)), we can aggregate and normalize the factor matrices from independent runs (Step \(\textcircled {2}\)):$$\begin{aligned} \overline{A^{(k)}}=\Big [ \frac{(A^{(k)})_{1}}{||(A^{(k)})_{1}||_{F}}~\frac{(A^{(k)})_{2}}{||(A^{(k)})_{2}||_{F}}~\ldots \frac{(A^{(k)})_{M}}{||(A^{(k)})_{M}||_{F}} \Big ] \in I_k \times (R \times M). \end{aligned}$$
(25)
The cophenetic correlation coefficient, a commonly employed metric for selecting ranks in matrix factorizations [32], assumes a one-to-one mapping between features and factors, primarily based on maximum loadings and evaluating the level of stability of inferred latent signals across different runs. While this approach is valuable, it’s important to acknowledge that this assumption may encounter challenges in scenarios where a feature contributes significantly to multiple factors. Despite this limitation, the cophenetic correlation coefficient remains a valuable tool for rank selection in many cases.Our method for selecting the rank and evaluating factorization stability involves clustering column factors of aggregated matrices and fixing the initial guess to ensure reliability. Initially, we perform K-means clustering [33] on the columns of the aggregated factor matrix \(\overline{A^{(k)}}\) with \(K=R\) (Step \(\textcircled {3}\)). The resulting cluster sets are given as \(C^{(k)} _{i}= \{ \text {columns of}~\overline{A^{(k)}}~\text {assigned to cluster}~ i\}, 1\le i \le R.\) The Local Outlier Factor algorithm [34] is used to remove outliers by considering the local density deviation of a data point compared to its neighbors. We evaluate the goodness of the clustering with the silhouette coefficient [35], computed as (b-a)/max(a,b), where a is the average intra-cluster distance, and b is the average inter-cluster distance. The silhouette coefficient ranges from -1 to 1, with higher values indicating more coherence. After clustering, we obtain the consensus factors \(f^{(k)}_{C_i}\), where \(1\le i\le R\), by computing the median value of the factors in each cluster (Step \(\textcircled {4}\)) and form the consensus matrix:$$\begin{aligned} \overline{A^{(k)}}_{C}=[ f^{(k)}_{C_1} f^{(k)}_{C_2} \ldots f^{(k)}_{C_{R}} ] \in {\mathbb {R}}^{I_k \times R}, 1\le k \le N. \end{aligned}$$
(26)
We perform the decomposition using \(\overline{A^{(k)}}_{C}\) as the fixed initial guess for the k-th modality to obtain the final factor matrices (Step \(\textcircled {5}\)).Notice that if ZIPTF is employed as the factorization method in Step \(\textcircled {1}\) described above, we refer to the resulting factorization as C-ZIPTF.In summary, when selecting the rank R, we ensure that the explained variance surpasses a predefined threshold of 0.9. Additionally, we evaluate the cophenetic correlation across various ranks to validate our selection, aiming for a score higher than 0.9. Next, at the candidate rank, we assess the silhouette score to determine if the optimal number of clusters of latent factors from multiple runs of the factorization aligns with the chosen rank.It is important to emphasize that achieving the mathematically “ideal” rank does not ensure optimal performance in capturing biologically relevant signals. To ensure that biological signals are adequately preserved, it is essential to complement mathematical optimization with domain knowledge, exploratory analysis, and validation against biological criteria.When assessing the sensitivity of parameter R, it’s important to consider the implications from a biological standpoint. Opting for a rank that is too low may lead to loss of important biological information, while excessively high rank may result in overfitting. It’s crucial to recognize that factorizing at a given rank still captures the true signal but potentially at a different resolution. Lower ranks may capture higher-level shared signals, while higher ranks may reveal finer subgroups within these categories. For example, in a 2-way case with gene expression count data for different donors (donors \(\times\) genes), factorizing it with \(R=1\) would produce a gene latent factor based on average expression of genes. Increasing to higher ranks may reveal distinct groups, such as disease/healthy, while even higher ranks may uncover finer subgroups within these categories. At some point, we would start capturing donor-specific signals. In Section  Identification of cell type identity and perturbation-specific programs using C-ZIPTF in single-cell RNA-seq data, we examine biological signals obtained through factorization at different ranks.

Hot Topics

Related Articles