Taxanorm: a novel taxa-specific normalization approach for microbiome data | BMC Bioinformatics

The overall workflow of TaxaNorm is depicted in Supplementary Figure 1. Each step of the algorithm is detailed below.Varying-dispersion zero-inflated negative binomial modelWe assume the observed taxa counts follow a ZINB distribution, which is a mixture of a negative binomial (NB) distribution of counts and a mass distribution at zero. The excess of zeros in microbiome data are handled in two ways: structural (biological) zeros through the mass distribution at zero and sampling zeros through the NB distribution. For a given taxon i \((i=1,…,p)\) in sample j \((j=1,…,n)\), let \(Y_{ij} \sim \textbf{ZINB}(\mu _{ij}, \theta _{ij}, \pi _{ij})\) denote the observed counts so that$$\begin{aligned} Y_{ij} \sim \{ \begin{array}{rcl} 0, & \text{ with } \text{ probability } \pi _{ij} \\ \textbf{NB}(\mu _{ij}, \theta _{ij}), & \text{ with } \text{ probability } 1 – \pi _{ij} \end{array}, \end{aligned}$$
(1)
where \(\mu _{ij}\) and \(\theta _{ij}\) are the mean and dispersion of the NB distribution and \(\pi _{ij}\) is the probability of zero mass, or the called zero-inflation parameter. Under this parameterization, the variance of NB distribution is \(\sigma _{ij}^2 = \mu _{ij} + \frac{\mu _{ij}^2}{\theta _{ij}}\). In particular, this NB distribution converges to a Poisson distribution when \(\theta _{ij} \rightarrow \infty\).To account for both sample- and taxon-specific effects of sequencing depth on counts, for a given taxon i, we have$$\begin{aligned}\log (\mu _{ij}) &= \beta _{i,0} + \beta _{i,1} X_j \nonumber \\\log (\theta _{ij}) &= \kappa _{i,0} + \kappa _{i,1} X_j \nonumber \\\log (\frac{\pi _{ij}}{1-\pi _{ij}}) &= \gamma _{i}, \end{aligned}$$
(2)
where \(X_j=\log (\sum _{i}y_{ij})\) is the log of sequencing depth for sample \(j=1,…,n\). This formulation allows the taxon-specific impact of sequencing depth on mean count (\(\beta _{i,1}\)) and dispersion (\(\kappa _{i,1}\)) to better capture the between-taxa variation and high heterogeneity in microbiome data compared to existing methods. Although many experimental and biological factors are linked with true zeros, we assume the zero-inflation parameter \(\pi _{ij}\) is taxon-specific only and is common across samples (j) since numerical evidence shows simpler models tend to have better model-fitting performance [42, 43]. Under the special case that \(\beta _{i,1}\) \((i=1,…,p)\) is equal to 1 for all taxa and \(\kappa _{i,1}=0\) \((i=1,…,p)\), TaxaNorm operates under a similar model assumption as most scaling normalization methods. On the other hand, when \(\kappa _{i,1}=0\) \((i=1,…,p)\), this is similar to the concept behind CSS and Wrench that dispersion does not change with sequencing depth.Parameter estimationWe fit the model and estimate the parameters for each taxon individually. For taxon i, given the observed counts \(Y_i=\{y_{ij}, j=1,…,n\}\), and the sequencing depth \(X=\{x_j, j=1,…,n\}\), the log-likelihood can be written as follows:$$\begin{aligned} \begin{aligned} l(\Lambda ; Y_i)&= \sum _{j=1}^{n} \log \{\pi _{ij} \textbf{I}\left( y_{i j}=0\right) \\&\quad + (1-\pi _{ij}) \frac{\Gamma (y_{ij}+\theta _{ij})}{\Gamma (y_{ij}+1)\Gamma (\theta _{ij})}(\frac{\mu _{ij}}{\theta _{ij}+\mu _{ij}})^{y_{ij}}(\frac{\theta _{ij}}{\theta _{ij}+\mu _{ij}})^{\theta _{ij}}\}, \end{aligned} \end{aligned}$$
(3)
where \(\Lambda = (\beta _{i,0}, \beta _{i,1}, \gamma _{i}, \kappa _{i,0}, \kappa _{i,1})\) denotes the full set of unknown parameters in (2), and \(\textbf{I}\left( \cdot \right)\) is an indicator function.In practice, directly maximizing (3) causes difficulty when distinguishing zeros from the NB part and the zero-inflation part, leading to an unreasonably low estimation of \(\pi _{ij}\) [42]. Therefore, we used the expectation-maximization (EM) algorithm to obtain the maximum likelihood estimations (MLEs) \(\hat{\Lambda } = (\hat{\beta }_{i,0}, \hat{\beta }_{i,1}, \hat{\gamma }_{i}, \hat{\kappa }_{i,0}, \hat{\kappa }_{i,1})\). We defined a latent random variable, \(Z_{ij}\), to indicate whether \(Y_{ij}\) is generated from the zero mass (\(Z_{ij}=1\)) or NB count (\(Z_{ij}=0\)). The log-likelihood now becomes$$\begin{aligned} \begin{aligned} l_{c}(\Lambda ; Y_i, Z)&= \sum _{j=1}^{n} \{z_{ij} \log \pi _{ij}(\Lambda ) +\left( 1-z_{ij}\right) \log \left( 1-\pi _{ij}(\Lambda )\right) \\&\quad +\left( 1-z_{ij}\right) \log f_{nb}\left( y_{ij}; \mu _{i j}(\Lambda ), \theta _{i j}(\Lambda )\right) \}, \end{aligned} \end{aligned}$$
(4)
where \(f_{nb}\) denotes the probability mass function (PMF) of the NB distribution.We set the starting values (\(\hat{\beta }^{(0)}_{i,0}\), \(\hat{\beta }^{(0)}_{i,1}\), \(\hat{\kappa }^{(0)}_{i,1}\)) at the estimates from a ZINB regression with \(\kappa _{i,1}=0\), using the R built-in function zeroinfl() from the pscl package. For \(\hat{\gamma }^{(0)}_{i}\), we initialized from a logistic regression with \(Y_i=0\) as the outcome to avoid the local maximum at the starting point [42].E step. For the tth iteration, the conditional expectation of the log-likelihood given the observed data \(Y_i\) and current parameter estimate \(\hat{\Lambda }^{(t)}\) is computed as:$$\begin{aligned} \begin{aligned} Q\left( \Lambda , \hat{\Lambda }^{(t)}\right)&= \text {E}\left\{ l_{c}(\Lambda ; Y_i, Z)|Y_i,\hat{\Lambda }^{(t)}\right\} \\&= \sum _{j=1}^{n} \{w_{ij}^{(t)} \log \pi _{ij}(\Lambda )+ \left( 1-w_{ij}^{(t)}\right) \log \left( 1-\pi _{ij}(\Lambda )\right) \\&\quad+ \left( 1-w_{i j}^{(t)}\right) \log f_{nb}\left( y_{ij};\mu _{ij}(\Lambda ),\theta _{ij}(\Lambda ) \right) \}, \end{aligned} \end{aligned}$$where$$\begin{aligned} w_{i j}^{(t)}&= \text {P}\left( z_{i j}= 1 \mid y_{ij},\hat{\Lambda }^{(t)} \right) \nonumber \\&= \frac{\pi _{i j}\left( \hat{\Lambda }^{(t)}\right) \textbf{I}\left( y_{i j}=0\right) }{\pi _{i j}\left( \hat{\Lambda }^{(t)}\right) \textbf{I}\left( y_{i j}=0\right) +\left( 1-\pi _{i j}\left( \hat{\Lambda }^{(t)}\right) \right) f_{n b}\left( y_{i j}; \mu _{i j}\left( \hat{\Lambda }^{(t)}\right) , \theta _{i j}\left( \hat{\Lambda }^{(t)}\right) \right) }. \end{aligned}$$
(5)
M Step. The parameter estimate is updated as \(\hat{\Lambda }^{(t+1)}\), maximizing the quantity \(Q\left( \Lambda , \hat{\Lambda }^{(t)}\right)\):$$\begin{aligned} \hat{\Lambda }^{(t+1)} = \arg \max _{\Lambda } Q(\Lambda , \hat{\Lambda }^{(t)}). \end{aligned}$$
(6)
These two steps are repeated until convergence is achieved:$$\begin{aligned} \frac{\left| Q\left( \Lambda , \hat{\Lambda }^{(t+1)}\right) – Q\left( \Lambda , \hat{\Lambda }^{(t)}\right) \right| }{\left| Q\left( \Lambda , \hat{\Lambda }^{(t)}\right) \right| } < \epsilon , \end{aligned}$$
(7)
where \(\epsilon\) is a small value threshold (here, \(1e-5\)).Finally, the estimated \(\hat{\mu }_{ij}\), \(\hat{\theta }_{ij}\), and \(\hat{\pi }_{ij}\) are calculated by plugging in \(\hat{\Lambda }\) to the above regression model (2).Taxa-specific normalization and diagnosis testingUsing the MLEs \((\hat{\mu }_{ij}\), \(\hat{\theta }_{ij}\), \(\hat{\pi }_{ij})\), we calculated the quantile residuals [44] as normalized taxa counts to remove the effects of sequencing depth:$$\begin{aligned} y^{\text {norm}}_{ij} = \Phi ^{-1}(F_{zinb}(y_{ij}-1;\hat{\mu }_{ij}, \hat{\theta }_{ij}, \hat{\pi }_{ij}) + u_{ij} \cdot f_{zinb}(y_{ij};\hat{\mu }_{ij}, \hat{\theta }_{ij}, \hat{\pi }_{ij})), \end{aligned}$$
(8)
where \(\Phi\) is the cumulative distribution function (CDF) of the standard normal distribution, \(F_{zinb}\) and \(f_{zinb}\) denote the CDF and PMF of the ZINB distribution, and \(u_{ij}\) is a random variable from a uniform distribution on (0, 1). Positive residuals for a given taxon in a given sample indicate greater abundance than expected given the taxon’s average abundance in the microbial community and sequencing depth while negative residuals indicate less abundance.Correctly specifying the model is critical to increase power for any analysis. Therefore, we propose two model diagnostic tests. First, we test the existence of a sequencing depth effect on taxa counts from two perspectives – mean and dispersion (the ‘prevalence’ test):$$\begin{aligned} \textbf{H}_0: \beta _{i,1}= \kappa _{i,1}=0, \forall i=1,…,p, \end{aligned}$$
(9)
where the alternative hypothesis (\(\textbf{H}_A\)) indicates that a sequencing depth effect exists for at least one taxon through one parameter. To do so, we use the likelihood ratio test (LRT). Specifically, under the null hypothesis, we fit a reduced intercept-only ZINB regression with fixed dispersion, where sequencing depth does not influence the taxa abundance. For this global test for all taxa, the likelihood ratio statistic is asymptotically, \(\chi ^2\), distributed with a degree of freedom of 2p.Additionally, two major improvements of TaxaNorm are that it assumes the effect of sequencing depth is taxon-specific (i.e., differential sequencing efficiency), and the dispersion parameter depends on sequencing depth, while most scaling methods basically assume \(\beta _{1,1}=…=1, \kappa _{1, 1}=…=0\). To test whether TaxaNorm better reflects the data than existing scaling methods, we conduct an ‘equivalence’ test with the following null hypothesis$$\begin{aligned} \textbf{H}_0: \beta _{1,1}&=…= \beta _{p,1} = 1, \nonumber \\ \kappa _{1,1}&=…= \kappa _{p,1} = 0. \end{aligned}$$
(10)
Again, we use LRT, where the MLE under the null hypothesis is estimated by restricting the equal effect of sequencing depth to be 1 on taxa abundance via the mean parameter, and fit a fixed dispersion ZINB regression. In this case, the likelihood ratio statistic is also asymptotically, \(\chi ^2\), distributed with a degree of freedom of 2p.Simulation studiesFor all simulations, we set the true values of regression parameters as those estimated from a subset of 147 stool samples in a real microbiota dataset from the human microbiome study detailed in the real-data application section, which ensured our simulated data have similar characteristics as actual case-study data. The dataset comprises 510 taxa with non-zero counts observed in more than 10 samples.For a given sample size n, we first randomly generated sequencing depth, \(X_{j} \sim \textbf{U}(a,b)\), \(j=1, 2,…, n\), where a and b are the minimum and maximum sequencing depths in the template data. Together with the coefficients estimated from the template data (\(\hat{\beta }_{i,0}, \hat{\beta }_{i,1}, \hat{\gamma }_{i}, \hat{\kappa }_{i,0}, \hat{\kappa }_{i,1}\)), we calculated the mean \(\hat{\mu }_{ij}\), dispersion \(\hat{\theta }_{ij}\), and zero mass \(\hat{\pi }_{ij}\) for each taxon across all samples based on model (2). We then generated taxa counts from ZINB distribution with these mean, dispersion, and zero mass parameters.Parameter estimation and diagnosis testsWe first assessed the performance of the proposed model diagnosis tests by considering scenarios with various coefficient settings. Specifically, we performed simulations under the two null hypothesis: 1) \(\beta _{i,1}=\kappa _{i,1}=0\) for all taxa, where sequencing depth effect does not exist for any taxa; and 2) \(\beta =1\), \(\kappa =0\), where the sequencing depth effect is the same across all taxa with fixed dispersion, to assess the type I error control for the tests proposed in (9) and (10). We conducted power analysis for prevalence tests by simulating various alternatives, where \(\beta _{i,1}=\kappa _{i,1}=0\) for a subset of taxa, while other parameters came from real data, so the sequencing depth effect exists for several taxa only. We also conducted power analysis for equivalence tests by restricting \(\beta = 1\) for a subset of taxa, while other parameter values varied. We set the sample size from 100 to 1000 for all scenarios and repeated each simulation scenario procedure 1000 times. See Supplementary Table 1 for details of the full simulation settings.DA taxa comparisonThe efficiency of data normalization can be evaluated by assessing the influence of a method on downstream analysis. We conducted additional simulations under various settings to examine the performance of our proposed normalization method in detecting DA taxa using post-normalization counts.To introduce DA taxa, we randomly assigned half of the samples to each group and manipulated the mean taxa count by multiplying the effect size. Specifically, we used a dummy variable, \(G_j = \{ \begin{array}{rcl} 0, & j \in {\textbf {group 1}} \\ 1, & j \in {\textbf {group 2}} \end{array}\), such that \(\log (\mu _{i,G_j}) = \beta _{i,0} + \beta _{i,1} X_j + \beta _{i,2} G_j\). Here, \(\beta _{i,2}\) can be regarded as the log fold-changes (i.e., \(\log (\frac{\mu _{i, j \in {\textbf {group 2}}}}{\mu _{i, j \in {\textbf {group 1}}}})=\beta _{i,2}\)). For example, we first allowed 50% of taxa to be DA, by randomly selecting 25% of all taxa with increased abundance in group 2 with the fold-changes (\(\exp (\beta _{i,2})\)) generated from \(\textbf{U}(4, 10)\), and another 25% of randomly selected taxa with decreased abundance in group 2 (i.e., \(\frac{1}{\exp (\beta _{i,2})} \sim \textbf{U}(4, 10)\)). The remaining 50% of taxa were not DA (i.e., \(\beta _{i,2}=0\)). We also investigated scenarios with varying percentages of DA taxa, namely 10% (5% increase and 5% decrease) and 20% (10% increase and 10% decrease), as well as smaller fold changes generated from \(\textbf{U}(2, 4)\). We varied sample sizes from 100 to 1000. The detailed simulation scenarios are outlined in Supplementary Table 2.Next, we simulated a scenario without DA taxa, where the observed difference in counts was completely due to sequencing depth. For this scenario, we generated the sequencing depth of the two groups separately at different levels. The sequencing depth for samples in the second group was, on average, three times greater than for the first group (\(X_{j} \sim \textbf{U}(3a,3b), j \in {\textbf {group 2}}\)). We forced all taxa in both groups to have identical model coefficients (i.e., \(\beta _{i, 2}=0\)).Finally, we consider performance when taxon-specific sequencing depth effects do not exist. We limited the sample size for each group to 500 with 50% DA taxa.We normalized the simulated raw counts with TaxaNorm and several other methods, namely ANCOM-BC, TSS, TMM, CSS, and Wrench (Supplementary Table 3). We conducted DA testing with the post-normalized counts for each taxon using the Wilcoxon rank-sum test, adjusted p-values using the Benjamini-Hochberg method [45] for false discovery rate (FDR) control at 0.05, and compared performance in terms of power and FDR for detecting true DA taxa. We repeated each simulation scenario procedure 1000 times.

Hot Topics

Related Articles