Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data | BMC Bioinformatics

OverviewGraphical models are graphs in which nodes are random variables and the presence of edges represent conditional dependence or “direct links” between them [17]. Two random variables, X and Y, are considered to be independent conditionally on variable Z if their conditional probability meets:$$\begin{aligned} P(X,Y|Z) = P(X|Z)P(Y|Z) \end{aligned}$$
(1)
Gaussian Graphical Models (GGMs) are a popular form of undirected graphical model in which a random variable X follows a multivariate Gaussian distribution \({\mathcal {N}}(0,\Sigma )\). The covariance matrix \(\Sigma\) is unknown and nonsingular [17].A GGM is represented by a partial correlation matrix P, as zero values in P correspond to conditional independence between variables [18]. Partial correlation coefficients measure linear relationships between pairs of random variables (i and j) which corrects for the effect from other variables (conditional dependence) [19]. The partial correlation coefficient \(\rho _{ij}\) can be calculated based on \(\Omega\), the inverse of covariance matrix \(\Sigma\):$$\begin{aligned} \rho _{ij} = -\frac{\Omega _{ij}}{\sqrt{\Omega _{ii}}\sqrt{\Omega _{jj}}}. \end{aligned}$$
(2)
In practice, the most common covariance estimator is the sample covariance matrix. However, when the number of features (p) exceeds the number of samples (n), the sample covariance matrix becomes singular and non-invertible [20]. In this situation, shrinkage techniques are often adopted to generate an invertible estimated covariance matrix for calculating the partial correlation matrix.There are two common approaches for estimation of sparse partial correlation matrices, using \(L_1\)/lasso regularization in a maximum likelihood framework (Lasso-type) and linear shrinkage to shrink the sample covariance matrix towards a structured target matrix (Stein-type). In Lasso-type shrinkage, let S be the sample covariance matrix, and \(\Theta\) be the estimated inverse covariance matrix, then the penalized log-likelihood function to maximize is [21]:$$\begin{aligned} {\hat{\Theta }} = \underset{\Theta \in \mathbb {R}^{p \times p}}{argmax} \{log(det(\Theta )) – tr(S\Theta ) – \lambda \parallel \Theta \parallel _1\} \end{aligned}$$
(3)
where \(\lambda\) is the amount of shrinkage applied and specified by users.The choice of regularization parameter \(\lambda\) is critical in Lasso-type shrinkage as different \(\lambda\) values lead to different sparsity in the resulting graphs and therefore the scientific conclusions drawn. Multiple methods and algorithms have been developed to choose \(\lambda\) which includes Akaike information criterion (AIC) and Bayesian information criterion (BIC) as standard methods for low-dimensional data. In high-dimensional data when number of features is relatively large compared to the sample size, stability approaches to regularization selection (StARS) or rotation information criterion (RIC) algorithms can be used [22, 23].On the other hand, Stein-type shrinkage comprises three components, an empirical covariance matrix S, a shrinkage target matrix T and a shrinkage constant \(\alpha\). The estimated covariance matrix is computed through a linear combination [15]:$$\begin{aligned} {\hat{\Sigma }} = \alpha T + (1 – \alpha ) S \end{aligned}$$
(4)
The identity matrix is often chosen as target matrix, which leads to the shrinkage of the off-diagonal covariance coefficients towards zero, as the identity matrix is only non-zero on the diagonal of the matrix.Similarly to Lasso-type shrinkage, the shrinkage intensity \(\alpha\) controls the intensity of shrinkage. However, the value of \(\alpha\) is not chosen by the user or selected using a grid search over possible values, but is derived directly by minimising a loss function which calculates the distance between the estimated covariance matrix and the population covariance matrix for a given value of \(\alpha\) [24]. The equations for the optimal values of \(\alpha\) are given in Table 1.In this study, Lasso-type shrinkage is implemented with the graphical lasso (glasso) [25] and Meinshausen–Buhlmann (mb) [26] algorithms, using the huge R package [27]. Stein-type shrinkage is applied using identity matrix as target matrix and shrinkage intensity (\(\alpha\)) is calculated based on results of the studies [16, 28] which are named as GeneNet and Fisher2011 algorithms, respectively (Table 1, Supplementary Figure S1, S2, S3 & S4).Table 1 Shrinkage intensity formula of Stein-type shrinkageData transformation and p-value calculation for stein-type shrinkage workflowsIn Gaussian graphical models, normality is a standard assumption about distribution of count data [29]. However, RNAseq data has specific properties such as skewness, extreme values, mean-variance dependency which deviates from the normal distribution assumption [30, 31]. Furthermore, scRNAseq can potentially deviate from a normal distribution due to other factors such as mixtures of cell subpopulations or ambient contamination. To alleviate these effects during scRNAseq data analysis, other developed methods such as cell clustering [32] to separate cells into roughly homogeneous groups to be analysed individually, and cell outlier detection can be applied in a data pre-processing step to ensure the data are amenable to graphical modelling techniques.To improve the performance of shrinkage methods, data transformation approaches are recommended [30]. Three data transformation methods are compared in terms of enhancing performance of Stein-type shrinkage. We consider log transformation as \(\log (count+1)\), which is highly used in biological data analysis to convert skew distribution towards normally distributed data [30]. We also apply a semiparametric Gaussian copula or nonparanormal transformation, which was developed for high-dimensional graphical modelling by using a Gaussian copula with Winsorized truncation [33]. This algorithm is implemented using huge.npn function from huge R package [27]. Lastly, we also consider the empirical copula \(\Phi ^{-1}(\,ECDF(\,count)\,)\,\) which is similar to the nonparanormal approach and includes a Gaussian copula based on the empirical cumulative distribution of the data without truncation.Unlike Lasso-type shrinkage methods, Stein-type shrinkage returns an estimated partial correlation matrix which requires significance testing to identify edges and generate a final graph. Without significance testing there is no way to identify non-zero entries of the partial correlation matrix from the estimated matrix, other than using heuristics based on the resulting graph structure, or by setting arbitrary thresholds, both of which could lead to large numbers of false positive predictions, especially for noisy data. Generally, partial correlation coefficients are treated as Pearson’s correlation coefficients and tested using t-statistics [34, 35]. This has been applied in the GeneNet R package for Stein-type shrinkage. To take into account the effect of shrinkage and its intensity on partial correlation matrix, Bernal et al. proposed two new probability densities for significance test which are referred to as shrunkv1 and shrunkv2 in this study [19, 35].Modifications in shrinkage frameworks for zero-inflated countsTo fully exploit the potential of scRNAseq data, one main challenge to overcome is the presence of excessive zeros in count data due to dropout events. Dropout events in scRNAseq data are defined as the situation where the expression of a gene is detected in some cells but absent in other cells of the same cell type [9]. Therefore, to reduce estimation errors in the sample covariance matrix and stratify biological zeros from dropout zero counts, we propose to use zero-inflated z score calculation based on zero-inflated negative binomial (ZINB) modelling of the count data in shrinkage framework. Details of the process is illustrated in Supplementary Figure S5.In the first step, a ZINB model is fitted for expression counts of each gene j in cell i (\(Y_{ij}\)) by the L-BFGS-B algorithm using the optim R function to derive maximum likelihood estimates of parameters of the model:$$\begin{aligned} Y_{ij} \sim w_j\delta + (1-w_j)NegativeBinomial(\mu s_i,r) \end{aligned}$$
(5)
where \(\mu\) is mean, s is vector of constant scaling factors (see Supplementary Figure S5) for each cell and r is the size parameter of the negative binomial distribution. Parameter \(w_j\) balances negative binomial model accounting for true expression counts and delta function. Importantly, the mean of the negative binomial distribution in our zero-inflated negative binomial modelling step is the product of parameters \(\mu\) and scaling factors s. Scaling factors are commonly used in data normalization for sequencing data to account for differences in sequencing depths between samples. They are calculated using the scran R package [36].For count of gene j in cell i, the delta function returns 0 value for all non-zero counts:$$\begin{aligned} \delta (y_{ij}) = {\left\{ \begin{array}{ll} 0 & \hbox { if } y_{ij} \ne 0 \\ 1 & \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)
Next, a non-detection rate (\(d_{ij}\)) is calculated based on the optimized ZINB model of each gene. For each count of gene j in cell i:$$\begin{aligned} d_{ij} = \frac{w_j\delta (y_{ij})}{w_j\delta (y_{ij}) + (1 – w_j)NB(y_{ij};\mu ,r)} \end{aligned}$$
(7)
To select zero counts in the data resulting from dropout events, a threshold t with default value 0.5 is applied. After zero counts potentially resulting from dropout events are identified, a nonparanormal transformation is applied to the count data [33].Zero-inflated z-scores are then calculated in which z-scores of zero counts identified as dropout events are set to 0. To calculate zero-inflated z-scores of nonparanormal transformed data X of gene j in cell i we use:$$\begin{aligned} Z_{ij} = {\left\{ \begin{array}{ll} \mathrm {\frac{X_{ij} – {\bar{X}}_j}{\sigma _j}} & d_{ij} < t \\ 0 & d_{ij} \ge t \end{array}\right. } \end{aligned}$$
(8)
Our proposed methodology, referred to below as ZIGeneNet, fits the ZINB mixture model given above to each gene, and then uses nonparanormal transformation of the data and the derived non-detection rates to calculate the zero-inflated z-scores \(Z_{ij}\) of each count which are then used in empirical covariance matrix S calculation of a Stein-type shrinkage approach. The shrinkage coefficient \(\alpha\) is chosen using the methodology of [16]. Significant edges are detected using correlation test statistics from fdrtool R package [34]. The workflow is outlined in figure 1.Fig. 1Zero-inflated covariance matrix shrinkage workflow. To account for presence of excessive zero counts in scRNAseq data, UMI count data is transformed into zero-inflated z scores for partial correlation matrix estimation. The zero-inflated z-score calculation step can be integrated in all shrinkage workflows, whereby a zero-inflated negative binomial is fitted to stratify zero counts from dropout event and “biological” zero counts. After the fitting, a non-detection rate (\(d_{ij}\)) is estimated in each cell for each gene. Thresholding (default t=0.5) is applied in which counts with \(d_{ij}\ge t\) are considered missing or zero-inflated values. Z scores are calculated for all counts and scores of zero-inflated values are set to 0Performance evaluation metricsAs gene regulatory networks are sparse relative to the total possible number of edges, the task of gene regulatory network inference has a large class imbalance, with the negative class (non-edges) being much larger than the positive (edges). As a result, for the evaluation of the network inference schemes on simulated data, the Matthews correlation coefficient (MCC) is used, which is a contingency matrix method to calculate the Pearson product-moment correlation coefficient between prediction and reference values [37]. The MCC score is able to take into account the class imbalance and so gives a better measure of performance than metrics like accuracy or ROC curves. The MCC score is calculated as follows:$$\begin{aligned} MCC = \frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{aligned}$$
(9)
where TP is the count of true positive predictions, TN is the count of true negative predictions, FP represents the count of false positives, and FN the number of false negative predictions. The MCC ranges from \(-1\) to 1, which indicates the worst and perfect classification respectively. The MCC is commonly applied in binary classification evaluation for imbalanced data [37].When benchmarking gene network inference methods it is common to rely only on synthetic data, as there is almost never a known ground truth network against which to benchmark networks inferred from experimental data. However to illustrate that the approach we propose is able to predict edges that have some supporting experimental evidence, we choose to report the numbers of predicted edges that exist in the databases listed below, both in raw figures and as positive predictive value (PPV)..In the experimental data analysis, gene interaction databases (IntAct, STRING, BioGRID) are used as references [38,39,40]. In terms of gene regulatory databases, information of interactions between transcription factors(TFs) and their targets is collected from Pombase (Schizosaccharomyces pombe) [41], YEASTRACT (Saccharomyces cerevisiae) [42], and TFLink (Mus musculus) [43]. An integrated database of TFs and their target for Plasmodium falciparum is unavailable and is not included in the experimental data analysis step.However it is important to note that the interactions in these databases are collected from multiple different experimental conditions, that are unlikely to occur simultaneously. For instance, interactions of transcription factors to initiate transcription of genes in stress response cannot be expected to be present in cells cultured in growing medium with no stress factors. Therefore, as it is not possible to reliably identify false negative edge predictions in this scenario, we only consider precision or PPV. This measures the fraction of predicted edges in the inferred network that have corresponding experimental evidence of their existence, and does not consider false negatives. PPV is calculated as follows:$$\begin{aligned} PPV = \frac{TP}{TP+FP} \end{aligned}$$
(10)

Hot Topics

Related Articles