Inference of single-cell network using mutual information for scRNA-seq data analysis | BMC Bioinformatics

Construction of single-cell network using mutual information (SINUM)To determine the gene‒gene association of genes X and Y in a specific cell, we applied MI to quantify the mutual dependency between their expression values, \(G_{X}\) and \(G_{Y}\). MI score for \({ }G_{X}\) and \(G_{Y}\) is defined as$$I\left( {G_{X} ;G_{Y} } \right) = H\left( {G_{X} } \right) + H\left( {G_{Y} } \right) – H\left( {G_{X} ,G_{Y} } \right)$$
(1)
where \(H\left( {G_{X} } \right)\), \(H\left( {G_{Y} } \right)\), and \(H\left( {G_{X} ,G_{Y} } \right)\) are the entropy of \(G_{X}\), entropy of \(G_{Y}\), and joint entropy of \(G_{X}\) and \(G_{Y}\), respectively. In information theory, MI as a similarity metric provides symmetric, i.e., \(I\left( {G_{X} ;G_{Y} } \right) = I\left( {G_{Y} ;G_{X} } \right)\), and non-negative, i.e., \(I\left( {G_{X} ;G_{Y} } \right) \ge 0\), measurement to determine the statistical dependency between two random variables (i.e., two genes in this work) [27]. In other words, MI quantifies how much a random variable reveals the other and could be interpreted as reducing uncertainty about one when given the knowledge of the other. When assessing the dependency, the higher the MI measure for a given gene pair, the stronger the coordinated expression between these two genes [18].To estimate MI for determining an edge between two genes on scRNA-seq data, we first generated a scatter diagram for every two genes from the GEM, i.e., m genes lead to \(m\left( {m – 1} \right)/2\) scatter diagrams, where each data point denotes a cell; x- and y-axes represent the expression values of these two genes in the n cells (Fig. 1A). The scatter diagram was further equally split into \(GR\) grids according to the total number of cells (here is n) and the distribution of cells; in other words, the minimum and maximum expression values of each two genes in the n cells were used to determine the overall boundaries and adjust the grid size. Thus, \(GR\) is the number of grids and given as$$GR = \lfloor\sqrt n + \frac{1}{2}\rfloor$$
(2)
Fig. 1Overview of SINUM (SIngle-cell Network Using Mutual information) method. A Generation of scatter diagrams for every two genes from the gene expression matrix (GEM), where the x- and y-axes are the expression values of every two genes within the n cells. Each point denotes a cell. B The statistical model of SINUM for estimating the association between genes X and Y. First, each scatter diagram containing n cells was split into \(\lfloor\sqrt n + \frac{1}{2}\rfloor\) grids. Next, SINUM produces two boxes \(G_{X}^{\left( c \right)}\) (light blue) and \(G_{Y}^{\left( c \right)}\) (medium blue) close to cell c to represent its neighborhoods of expression values for genes X and Y, respectively; thus, the intersection region can directly produce the third box \(G_{XY}^{\left( c \right)}\) (dark blue). The entropies \(H\left( {G_{X} } \right)^{\left( c \right)}\), \(H\left( {G_{Y} } \right)^{\left( c \right)}\), and \(H\left( {G_{X} ,G_{Y} } \right)^{\left( c \right)}\) of the boxes \(G_{X}^{\left( c \right)}\), \(G_{Y}^{\left( c \right)}\), and \(G_{XY}^{\left( c \right)}\), respectively, were then evaluated for calculating mutual information, \(I\left( {G_{X} ;G_{Y} } \right)^{\left( c \right)}\). The mutual information is used to evaluate whether any given two genes X and Y, are a dependent or independent gene pair in cell c among all cells. If the value of \(I\left( {G_{X} ;G_{Y} } \right)^{\left( c \right)}\) is larger than the threshold, it suggests that genes X and Y are dependent on each other in cell c and will be represented by an edge in a network. C Construction of n single-cell networks (SCNs) for n cells. For m genes, a total of \(m\left( {m – 1} \right)/2\) scatter diagrams were generated to measure all possible associations between two genes. In each SCN, the red solid line represents that there’s an edge between two genes for a specific cell inferred by our SINUM method; otherwise, there’s no edge. D Generation of degree matrix (DM) by counting the number of edges connected to every gene in each SCN. Note that the size of DM is the same as GEM with m rows and n columnsIf the cells are uniformly randomly distributed in the scatter diagram for two genes, we can expect the number of cells in each grid to be nearly equal; accordingly, the two genes are independent of each other. Then, we defined the tentative neighborhoods for the target cell c based on the given box size. The box size was used to determine the neighborhood size; for example, the box size of 0.2 means that there are 20% of n cells within the tentative neighborhood of cell c. To reduce the intrinsic fluctuation of gene expression values in scRNA-seq data, the tentative neighborhood was expanded outward to the closest grid boundaries and formed the final neighborhoods, \(G_{X}^{\left( c \right)}\) (light blue) and \(G_{Y}^{\left( c \right)}\)(medium blue), for genes X and Y, respectively (Fig. 1B). Within the final neighborhoods \(G_{X}^{\left( c \right)}\) and \(G_{Y}^{\left( c \right)}\), several sub-regions \(x^{\left( c \right)}\) and \(y^{\left( c \right)}\), could be identified based on the corresponding grids, respectively; in other words, \(x^{\left( c \right)}\) or \(y^{\left( c \right)}\) is separately a column or row of grids belonging to \(G_{X}^{\left( c \right)}\) or \(G_{Y}^{\left( c \right)}\). The intersection of the two final neighborhoods \(G_{X}^{\left( c \right)}\) and \(G_{Y}^{\left( c \right)}\) can directly produce the third box, called \(G_{XY}^{\left( c \right)}\) (dark blue), and is also divided into several sub-regions \(xy^{\left( c \right)}\) (Fig. 1B).In this study, we derived a local measurement from Eq. (1) to evaluate whether any given two genes X and Y, are dependent or independent in cell c. Thus, the MI score, \(I\left( {G_{X} ;G_{Y} } \right)^{\left( c \right)}\), is given as \(I\left( {G_{X} ;G_{Y} } \right)^{\left( c \right)} = H\left( {G_{X} } \right)^{\left( c \right)} + H\left( {G_{Y} } \right)^{\left( c \right)} – H\left( {G_{X} ,G_{Y} } \right)^{\left( c \right)}\). Then, the uncertainty and randomness of a random variable (i.e., a gene) in the probability distribution could be quantified by a measure of entropy. The entropy of \(G_{X}\) in cell c, \(H\left( {G_{X} } \right)^{\left( c \right)}\), is calculated as$$H\left( {G_{X} } \right)^{\left( c \right)} = – \mathop \sum \limits_{{x^{\left( c \right)} \in G_{X}^{\left( c \right)} }} p\left( {x^{\left( c \right)} } \right) \times \log p\left( {x^{\left( c \right)} } \right)$$
(3)
where \(p\left( {x^{\left( c \right)} } \right)\) is the probability for \(x^{\left( c \right)}\). Let \(n_{x}^{\left( c \right)}\), \(n_{y}^{\left( c \right)}\), and \(n_{xy}^{\left( c \right)}\) denote the number of cells inside \(x^{\left( c \right)}\), \(y^{\left( c \right)}\) and \(xy^{\left( c \right)}\), respectively, and the probability \(p\left( {x^{\left( c \right)} } \right)\) can be substituted by the frequency numerically:$$p\left( {x^{\left( c \right)} } \right) = \frac{{n_{x}^{\left( c \right)} }}{n}$$
(4)
\(H\left( {G_{Y} } \right)^{\left( c \right)}\) and \(H\left( {G_{X} ,G_{Y} } \right)^{\left( c \right)}\) were defined by following the same manners.Finally, we transformed the MI score obtained from entropy for every gene pair to a z score, \(z_{XY}^{\left( c \right)}\). \(z_{XY}^{\left( c \right)}\) is defined as$$z_{XY}^{\left( c \right)} = \frac{{I\left( {G_{X} ;G_{Y} } \right)^{\left( c \right)} – \mu_{XY} }}{{\sigma_{XY} }}$$
(5)
where \(z_{XY}^{\left( c \right)}\) denotes the significant level of the MI score between genes X and Y in cell c; \(\mu_{XY}\) and \(\sigma_{XY}\) are the mean and standard deviation of MI scores, respectively, between genes X and Y across all cells. Here, the z score is used to determine the presence of an edge (association) between any given two genes in a single-cell network (SCN). There is an edge if its z score reaches above the threshold (Fig. 1C). For all gene pairs and cells, we could eventually construct n SCNs for n cells after repeating this procedure. Like the GEM, the DM was constructed by counting and normalizing the number of edges connected to each gene in each SCN (Additional file 1: Note S1); thus, DM has the same column and row numbers as the GEM (Fig. 1D).Network-based clustering with dimension-reduction for identifying cell typesRecently, the DM has been used to provide new insights from network science perspectives in applying the existing scRNA-seq technique, including dimensionality-reduction, clustering, and visualization [19]. Here, we first applied the principal component analysis (PCA) [28] to reduce the DM (or GEM) to 20 dimensions and further reduce it to two dimensions for visualization using the t-distributed stochastic neighbor embedding (t-SNE) [29]. PCA and t-SNE represent linear and non-linear methods of dimensionality reduction, respectively. Second, we implemented k-means and hierarchical algorithms to perform clustering analysis. The k-means algorithm clusters data by separating samples into k groups of equal variance via minimizing the sum of the distances between the centroid and all member objects of the group. This algorithm requires the number of clusters to be specified. On the other hand, the hierarchical clustering builds nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a dendrogram, in which the root is the unique cluster that gathers all the samples and the leaves are the clusters with only one sample. Finally, we evaluated the clustering performances of different methods mainly based on eight performance indexes, including adjusted rand index (ARI), F-measure index (FMI), adjusted mutual information (AMI), completeness scores (CPT), Fowlkes-Mallows scores (FMS), homogeneity scores (HMG), and normalized mutual information (NMI). Note that we set the same parameters when performing dimension-reduction, clustering, and performance evaluation for SINUM DMs, CSN DMs, and GEMs.Pre-processing of scRNA-seq dataThe SCNs were constructed by the GEM of each scRNA-seq dataset and then transformed to the DM. Due to a large number of dropout events (i.e., zero values) in scRNA-seq data, we filtered out the genes expressed in less than ten cells and performed log2 transformation with a pseudo count of one on the raw matrix, i.e., raw matrix → GEM → SCNs → DM. Therefore, the DM has the same number of columns and rows as the GEM.The SINUM method possesses a certain degree of unavoidable time complexity because of computing \(m\left( {m – 1} \right)/2\) pair of genes when given a total of m genes. Thus, we first applied the FEAture SelecTion (FEAST) tool [30] to select the top 1,000 significant features from the raw matrix as representative genes for determining the suggested setting of two adjustable parameters in the SINUM method (i.e., box size and z-score threshold). The FEAST tool, designed for feature selection of scRNA-seq, can find clusters with high confidence by consensus clustering method, retains cells with high correlation with clusters, and finally identifies the significant features through F-statistics and ranking. The suggested parameters were determined by the ranking and re-ranking scores of the clustering performances based on the selected 1,000 genes (Fig. 2A). Specifically, we measured the ranking score for each dataset by sorting their F-measure scores for SINUM DMs in descending order using 25 different parameter combinations, including five different box sizes (0.05, 0.1, 0.15, 0.2, 0.25) and five different z-score thresholds (− 2,− 1, 0, 1, 2). Next, the re-ranking score for each parameter combination using k-means (or hierarchical clustering) was evaluated based on the mean of ranking scores across seven datasets. Finally, the re-ranking scores from two clustering methods were averaged to determine the overall performance. Since the smaller mean of re-ranking scores represents better performance, we used the box size = 0.2 and z-score > 0 as default parameters to build SINUM SCNs in this study.Fig. 2Clustering performances of DMs and GEMs on seven scRNA-seq datasets. A Distributions of ranking/re-ranking scores of clustering performances of SINUM DMs generated at different parameter combinations (on the x-axis, sorted by decreasing average re-ranking scores). In this analysis, we utilized the FEAST algorithm to select the top 1000 representative genes for each dataset and further performed the SINUM method to build SCNs and DMs. The k-means (green) and hierarchical (orange) clustering performances of these SINUM DMs on seven datasets (dots) were shown in the boxplot. The red dashed line represents the mean of the re-ranking scores evaluated by the ranking scores across seven datasets using two clustering methods. For each dataset, the ranking scores were measured by sorting the clustering performances at different parameter combinations between various box sizes and z score thresholds (for details, see Additional file 1: Tables S3 and S4). B Comparisons of k-means clustering performances of SINUM DMs (red), CSN DMs (blue), and GEMs (gray) on seven scRNA-seq datasets, evaluated by eight distinct performance indexes. Note that the SCNs were generated using whole GEMs. Each dot presents a dataset. ARI, adjusted rand index; FMI, F-measure index; AMI, adjusted mutual information; CPT, completeness scores; FMS, fowlkes-mallows scores; HMG, homogeneity scores; NMI, normalized mutual information; VMS, V-measure scoresScRNA-seq datasets for validationTo validate and compare SINUM SCNs, CSN SCNs, and GEM, we collected and downloaded seven ScRNA-seq datasets from the Gene Expression Omnibus (GEO) database [31]. The annotations of corresponding cell types provided by original works were assembled for each dataset. The brief introductions and summaries of these seven datasets are described as follows and listed in Additional file 1: Table S2.Chu-type dataset [32], including seven cell types and 1,018 cells, comprises the cells of human embryonic stem cell-derived lineage-specific progenitors. The cell types contain H1 embryonic stem cells, H9 embryonic stem cells, human foreskin fibroblasts, neuronal progenitor cells, definitive endoderm cells, endothelial cells, and trophoblast-like cells.Chu-time dataset [32], including six cell types and 758 cells, is composed of cells from human embryonic stem cells to produce definitive endoderm cells at different time points. The cell types contain six different time points, including 0 h, 12 h, 24 h, 36 h, 72 h, and 96 h of differentiation.Haring dataset [33], including 30 cell types and 1,545 cells, is composed of mouse brain dorsal horn cells. The cell types contain 15 inhibitory and 15 excitatory neuronal types, including the Gaba 1–15 and Glut 1–15, revealed by clustering cells with similar characteristics.Baron dataset [34] has six sub-datasets, including four human and two mice samples, the human sample 1 was collected and used in this study. The human sample 1 sub-dataset, containing 14 cell types and 1,937 cells, comprises individual pancreatic cells from one of four human donors. The cell types are composed of alpha, beta, delta, gamma, epsilon, acinar, ductal, quiescent stellate, activated stellate, endothelial, macrophage, mast, cytotoxic T, and Schwann cells.Romanov dataset [35], including seven cell types and 2,881 cells, is composed of the mouse neuron cells of the central column of the medial-ventral diencephalon. The cell types contain oligodendrocytes, astrocytes, ependymal cells, microglial cells, endothelial cells, vascular and smooth muscle lineage cells, and neurons.Yan dataset [4, 36], including five cell types and 124 cells, comprises the cells from human pre-implantation embryos and human embryonic stem cells. The cell types contain human embryonic stem cells and different stages of human preimplantation blastomere, containing oocyte, zygote, 2-cell, late developmental cells.Darmanis dataset [37], including nine cell types and 466 cells, comprises the cortex cells from adult and fetal human brain samples. The cell types contain astrocytes, neurons, oligodendrocytes, endothelial cells, oligodendrocyte precursor cells (OPCs), replicating neuronal progenitors (fetal- replicating), quiescent newly born neurons (fetal-quiescent).

Hot Topics

Related Articles