SAE-Impute: imputation for single-cell data via subspace regression and auto-encoders | BMC Bioinformatics

The overview of the SAE-Impute algorithmIn recent years, the exponential growth of biological information data has led to an increased adoption of deep learning in the fields of biology and biomedicine. The subspace regression algorithm has demonstrated its effectiveness in preserving correlation relationships between data during the clustering process of single-cell data. Additionally, the autoencoder, acknowledged by numerous scholars as a potent tool for comprehending the intricate structure of data for reconstruction [34], plays a pivotal role in this context. Building upon these insights, we introduce SAE-Impute, an innovative model that amalgamates the subspace regression model and autoencoder. Emphasizing the preservation of correlation between cells, the SAE-Impute algorithm initially employs the subspace regression model to predict potential dropout values. Subsequently, an autoencoder model is constrained by these predicted values to impute dropout events in scRNA-seq data (Fig. 8). In this study, our methodology involves (A) acquiring single-cell sequencing data and its systematic organization, (B) employing subspace regression models for predictive analyses, and (C) identifying dropout values, subsequently addressing them through imputation utilizing autoencoder models. Subsequently, (D) the imputation results are generated, and the experiments are comprehensively compared, evaluated, and analyzed. A comprehensive description of the proposed method will be provided below.Fig. 8Overview of SAE-impute. A acquiring single-cell sequencing data and its systematic organization, B employing subspace regression models for predictive analyses, and C identifying dropout values, subsequently addressing them through imputation utilizing autoencoder models. Subsequently, D the imputation results are generated, and the experiments are comprehensively compared, evaluated, and analyzedData processingTo generate a reference dataset from real scRNA-seq data, we first selected high-quality cells and genes with high expression from the original dataset, treating them as the real expression IR. We then generated downsampling by drawing a Poisson distribution with the mean parameter, resulting in the observation dataset IO. Here are the specific details of the data collection process:For the human islet data from Barron et al [25]. we filtered out genes with mean expression less than 0.001 and non-zero expression in less than three cells. Genes with non-zero expression in 25% of cells and cells with a library size greater than 5,000 were then selected from the filtered dataset containing 14,729 genes and 1,937 cells. This resulted in 2,284 genes and 1,076 cells. For the mouse hypothalamus data from Chen et al. [26], we filtered out cells with a library size greater than 15,000, as well as genes with mean expression less than 0.0002 and non-zero expression in less than five cells. We then selected genes with non-zero expression in 20% of cells and cells with a library size greater than 2,000, resulting in 2,159 genes and 7,712 cells. For the human ventral midbrain data from Ramanno et al. [27], we filtered out genes with mean expression less than 0.001 and non-zero expression in less than three cells. We then selected genes with non-zero expression in 30% of cells and cells with a library size greater than 5,000, resulting in 2,059 genes and 947 cells. For the Zeisel et al. [28] mouse cortex and hippocampus data, we selected genes with non-zero expression in 40% of cells and cells with a library size greater than 10,000 UMI, resulting in 3,529 genes and 1,800 cells.Subspace regression models to make predictionsThis section presents the specific calculation method for the subspace regression model as a priori model. Firstly, it is determined whether each zero value is a result of dropouts. Given a zero-valued entry, let p1 and p2 denote the probability of observing a zero value in the corresponding gene and cell, respectively. Since genes and cells have zero values that are binomially distributed as \(X\sim Bin(n,p_1)\) and \(Y\sim Bin(m,p_2)\), assuming n is the number of gene measurements and m is the number of cell measurements, in the case of zero values, \(p = p1 = p2\). If X and Y are independent, then \(X+Y\sim Bin(n+m,p)\) holds true. Therefore, the conditional distribution of X, \(P(X=x\mid X+Y=r)\) is a hypergeometric distribution, where x represents the number of zero values observed in a gene, and r is the total number of genes observed and zero values in the selected gene and cell pair. The probability function of the hypergeometric distribution can be expressed as follows:$$\begin{aligned} P(X=x-1\mid X+Y=r-1)=\tfrac{\genfrac(){0.0pt}0{n-1}{x-1} \genfrac(){0.0pt}0{m}{r-x}}{\genfrac(){0.0pt}0{n+m-1}{r-1}} \end{aligned}$$
(3)
Please note that Equation (1) accounts for overlapping entries in both X and Y for each gene and cell pair. To address this, we adopt the following strategy: (i) We use \((n+m-1)\) instead of \((n+m)\) as the total number of observations in the selected gene pair, (ii) we use \((n-1)\) instead of (n) as the number of gene measurements, and (iii) we use \((x-1)\) instead of (x) as the number of genes with zero values observed in the cells. This is because such genes do not contribute to the hypergeometric probability calculation.We then calculate a p-value for each zero-value and perform two tests: underrepresentation analysis and overrepresentation analysis, with a significance threshold of 0.01. Entries with significant p-values in the overrepresentation analysis are considered implausible and should be classified. On the other hand, entries with significant p-values in the underrepresentation analysis are considered reliable. Entries that fall in neither category should be disregarded. These values are not extrapolated and should not be used for extrapolation purposes.Based on this hypothesis testing process, we obtain a set of genes that can be used for training (training data) and a set of genes that need to be attributed (attributable data). A gene is classified as plausible if all its entries are plausible, while a gene is considered attributable if at least one value is attributable.In order to accurately infer dropout values for genes, it is crucial to utilize related genes with similar expression patterns. This module aims to identify subspaces of genes within the training data that share similar patterns. We will then use a generalized linear regression model on the gene subspace to estimate dropout values in groups. To assign a gene in the imputable set \(g\in I_O\) to a subspace, we compute the Euclidean distance from that gene to the centroid of each gene subspace \(d\left( AB\right) =\sqrt{\sum _{i=1}^N \left( x_i – y_i\right) ^2}\). Based on the calculated distances, we assign the gene to the closest subspace (with the smallest Euclidean distance). To estimate dropout values in gene g, we train a generalized linear model using only highly correlated genes within the specified subspace in T. The linear regression process consists of two steps: selecting highly correlated genes from the training set and training a linear model using these genes to estimate dropout values. To ensure that genes with high expression values do not dominate the regression process, we always use a logarithmic transformation (base 2) to rescale the data to an acceptable range ([0,100] by default). To obtain the prediction matrix \(I_P\) for matrix \(I_O\).Autoencoder modelsIn recent years, the accuracy of collaborative filtering has significantly improved due to the emergence of representation learning. In particular, autoencoder-based models have played a prominent role in this enhancement [35,36,37]. Unlike matrix factorization or kernel norm minimization techniques, autoencoders require estimating \(2 \times m.r\) independent variables. This reduction in the number of parameters is advantageous in data-constrained scenarios like ours, where models are susceptible to overfitting. Fewer parameters decrease the model’s propensity for overfitting and enhance its generalization capabilities, resulting in improved overall performance.As a self-supervised learning approach, autoencoders inherently learn data structures, making them well-suited for the analysis of single-cell sequencing data.An autoencoder consists of two main components: an encoder E and a decoder D. Initially, the input matrix is transformed into a latent representation (H) where the activation function \(\phi \) is applied, resulting in \(H=\phi (EY)\). Subsequently, the decoder (D) maps the latent space (H) back to the input space to yield \(X=DH=D\phi (EX)\). During the training phase, the encoder and decoder work collaboratively to minimize the Euclidean cost function \(f(x)=\mathop {\hbox {argmin}}\limits _{D,E}\parallel X-D\phi (EX)\parallel _F^2 \).In our approach, we leverage the similarities between this problem and collaborative filtering by using the original matrix \(I_O\) and the matrix \(I_P\) predicted by the subspace regression model as input data Y. Both matrices are mapped to the latent space (H) during training of the encoder and decoder functions. Our ultimate goal is to regenerate the estimated expression matrix \(I_R\) by minimizing the cost function for optimal imputation.To enhance the effectiveness of the autoencoder, further details regarding the model architecture and hyperparameter selection are necessary, particularly the choice of the regularization coefficient \(\lambda \). Additionally, we need to clarify how the output from the subspace regression model is integrated within the autoencoder framework. Ultimately, we aim to regenerate the estimated expression matrix \(I_R\) by minimizing the cost function for perfect imputation as fellow:$$\begin{aligned} \mathop {\hbox {argmin}}\limits _{E,D}\parallel R-D\sigma (E(R)))\parallel _O^2 + \frac{\lambda }{2} \left( \ \parallel E\parallel _F^2 + \parallel D\parallel _F^2 \right) \end{aligned}$$
(4)
In this context, R is computed as the Hadamard product of \(R=M\circ X\) (where M is a binary mask), E and D is the decoding mask representation, while \(\lambda \) is the regularization coefficient. The loss is computed using the count of non-zero elements in the sparse expression matrix \(M\circ X\), denoted by O. The sigmoid activation function is applied by the encoder layer in the neural network and represented by \(\sigma \). To avoid overfitting to the non-zero values in the count matrix, we apply regularization to the encoder and decoder matrices during training. After training, the learned matrices are used to estimate the expression matrix, which is represented by Eq. 3. The estimated matrix, denoted as \(\tilde{X}\), contains predicted count values for all positions in the matrix.$$\begin{aligned} \tilde{X} =D\sigma (E(R)) \end{aligned}$$
(5)
The input raw gene expression matrix undergoes a series of preprocessing steps including filtering for bad genes, normalization to library size, trimming by gene selection, and log-transformation. The resulting processed matrix and the prediction matrix obtained from the subspace regression model are then inputted into the AutoImpute model.The SAE-Impute model is comprised of a fully connected multi-layer perceptron consisting of three layers: an input layer, a hidden layer, and an output layer. The model leverages imputation weight labels, which are comprised of data filled with subspace regression model predictors \(I_P\), to improve the filling of missing values. Gradients are computed using backpropagation of errors and the model is trained using gradient descent to minimize the cost function.

Hot Topics

Related Articles