Detecting significant expression patterns in single-cell and spatial transcriptomics with a flexible computational approach

The SPIRAL protocolWe start by observing an expression matrix A with M rows (genes) and N columns (cells for scRNA-seq or spots for spatial transcriptomics). We assume that there is a set of genes acting in concordance with one another to execute some biological process on a set of cells. Some of these genes may be driving the process and some associated with it in other ways. The coordinated expression pattern can be inferred from the data, providing an indication of the biological process.We aim to find this process by finding a collection of pairs of columns (cells\(\backslash\)spots) in which the expression of the relevant genes changes significantly and concordantly.Pre-processing the dataTo eliminate the effect of outliers and technical artifacts, we filter out cells\(\backslash\)spots (scRNA-seq\(\backslash\)spatial transcriptomics respectively) with abnormally high or low number of expressed genes (which may be doublets and empty cells)1. Then, for the species homo sapiens, mus musculus, rattus norvegicus, drosophila melanogaster, danio rerio or saccharomyces cerevisiae, we identify mitochondrial genes by querying a local copy of the Ensembl database48 (version 110), and remove cells\(\backslash\)spots with a large percent of mitochondrial RNA, as this could suggest cell stress49,50. As a final pre-processing step, we normalize each cell\(\backslash\)spot to the median number of counts per cell\(\backslash\)spot.Computing representing cellsIn order to avoid dropouts and inaccuracies which stem from technical issues, we cluster the cells\(\backslash\)spots into small low variance clusters. Then, for each cluster, we consider its member cells\(\backslash\)spots to be biological replicates and compute their average expression profile. This average expression profile is the “representing cell” of that cluster. This idea is inspired by Iacono et al.51’s iCells. Specifically, the clustering is executed here by employing hierarchical agglomerative clustering with the ’ward’ linkage. The default number of clusters is set to a 100; However, users can change this value in the running process according to need and available computing power. After clustering, for datasets with 2000 cells\(\backslash\)spots or more, clusters with less than 10 cells\(\backslash\)spots are discarded. For datasets with less than 2000 cells\(\backslash\)spots, clusters are discarded is they have less than half the expected number of cells\(\backslash\)spots. For example, for a dataset with a 1000 cells, the expected number of cells per cluster is \(1000/100=10\), so clusters with less than 5 cells are discarded. For convenience, we refer to the representing cells as “repcells”.The expression profiles of all repcells are then stored in a new expression matrix \(A_c\) with size M by \(N_c\), whose rows are genes and columns are the repcells.This step improves accuracy by averaging biological replicates, eliminates dropouts, and reduces the number of samples while preserving all genes. The reduction in data size shortens the execution time of the SPIRAL algorithm without removing significant information from the data (as demonstrated on a synthetic dataset, see Fig. 2).Registering changes in expression valuesWe first standardize each gene separately. In this step, the (g, j)th entry in the normalized matrix \(\hat{A_c}\) is computed as \(\hat{A_c}(g,j)=\frac{A_c(g,j)-m_g}{\sigma _g}\), where \(m_g\) is the gene average expression and \(\sigma _g\) is its standard deviation over all \(N_c\) repcells. Explicitly: \(m_g=\frac{\sum _{j\in \{1,…,N_c\}} {A_c(g,j)}}{N_c}\), \(\sigma _g=\sqrt{\frac{\sum _{j\in \{1,…,N_c\}} {(A(g,j)-m_g)^2}}{N_c-1}}\).We compute the difference in expression values between every two repcells, for all genes, by multiplying \(\hat{A_c}\) on the right by a pairwise subtraction matrix B:$$\begin{aligned} S=\hat{A_c}\cdot B \end{aligned}$$
(1)
B is constructed so that all repcell pairs (in both directions) will be evaluated. It has \(N_c\) rows and \(N_c\cdot (N_c-1)\) columns. For example, for 4 repcells (\(N_c=4\)), B would be:$$\begin{aligned} B= \left[ {\begin{array}{cccccccccccc} 1 & 1 & 1 & -1 & 0 & 0 & -1 & 0 & 0 & -1 & 0 & 0\\ -1 & 0 & 0 & 1 & 1 & 1 & 0 & -1 & 0 & 0 & -1 & 0\\ 0 & -1 & 0 & 0 & -1 & 0 & 1 & 1 & 1 & 0 & 0 & -1\\ 0 & 0 & -1 & 0 & 0 & -1 & 0 & 0 & -1 & 1 & 1 & 1\\ \end{array} } \right] \end{aligned}$$The resulting matrix S represents the repcells’ differences. It has rows that correspond to genes, and columns that correspond to ordered repcell-pairs. An entry S(g, (i, j)) is the difference in expression levels of gene g between a repcell i and a repcell j, i.e:$$\begin{aligned} S(g,(i,j))=\hat{A_c}(g,j)-\hat{A_c}(g,i) \end{aligned}$$
(2)
Seeking dense sub-matricesWe now wish to find a set of genes (rows in S) and a set of repcell-pairs (columns in S) that would constitute a sub-matrix that is populated by almost exclusively high values (a formal definition is given later in this section). For this purpose we first binarize the matrix S such that$$\begin{aligned} S_{bin}(g,(i,j))= {\left\{ \begin{array}{ll} 1, & \text {if}\ S(g,(i,j))\ge \alpha \\ 0, & \text {otherwise} \end{array}\right. } \end{aligned}$$
(3)
where \(\alpha\) is a parameter determining the minimal number of standard deviations (of a gene) to be considered a significant change in expression (since we can also write: \(S_{bin}(g,(i,j))=1 \iff {A_c}(g,j)-{A_c}(g,i)\ge \alpha \sigma _g\)). The results in this paper were obtained with \(\alpha \in \{0.5, 0.75, 1, 1.25, 1.5\}\). Users have the option to use this default set of values for \(\alpha\), or to adjust it to fit the data better.We now wish to find a dense sub-matrix in the binary matrix \(S_{bin}\). For that purpose, we introduce the following definition:
Definition 1
a \(\mu\)-dense sub-matrix is a collection of rows \(\Gamma\) and columns \(\Pi\) of a binary matrix such that every row in \(\Gamma\) has at least \(\mu |\Pi |\) non-zero values within the columns \(\Pi\) and every column in \(\Pi\) has at least \(\mu |\Gamma |\) non-zero values within the rows \(\Gamma\).
Formally: \(S_{bin}(\Gamma ,\Pi )\) is \(\mu\)-dense if:$$\begin{aligned} \begin{aligned} \sum S_{bin}(g,\Pi ) \ge \mu |\Pi | \;\;\; \forall g \in \Gamma \;\;\;\;\;\;\;\;\;\;\; \\ \sum S_{bin}(\Gamma ,(i,j)) \ge \mu |\Gamma | \;\;\; \forall (i,j) \in \Pi \end{aligned} \end{aligned}$$
(4)
where \(|\Gamma |\) is the number of rows in \(\Gamma\), \(|\Pi |\) is the number of columns in \(\Pi\), and \(\mu\) is a density parameter in the range \(( \, 0,1] \,\), which, in practice, is typically close to 1.The iterative dense sub-matrix algorithm Inspired by both Koyuturk et al.43 and Uitert et al.44, we aim to find \(\mu\)-dense sub-matrices by using an iterative algorithm (see Algorithm 1). The algorithm initiates each of its T iterations with a random path of repcells with length L, and then repeatedly improves its choice of genes (rows) and repcell-pairs (columns), until convergence to a single structure (when the structure remains unchanged after executing the while loop). The results in this paper were obtained with \(\mu \in \{0.9, 0.95\}\), \(L=3\) and \(T=10000\). Users have the option to change the set of values for \(\mu\).Algorithm 1The iterative dense sub-matrix algorithmEvaluating the statistical significance of a given sub-matrixWe now wish to calculate a score representing the statistical significance of a \(\mu\)-dense sub-matrix of the matrix S, which consists of the set of columns (repcell-pairs) \(\Pi\) and the set of rows (genes) \(\Gamma\). We clarify that although the sub-matrix was found using a binarized version of S, the significance evaluation is done on the original S which contains continuous values.This sub-matrix in S was originally produced by$$\begin{aligned} S(\Gamma ,\Pi )=\hat{A_c}(\Gamma ,:) \cdot B(:,\Pi ) \end{aligned}$$
(5)
Computing the statistical significance of a single gene and a set of repcell-pairs. For the null model, we assume that a standardized expression value of a gene g in a repcell i is distributed as \(\mathscr {N}(0,1)\). Moreover, we assume independence and thus the set of standardized expression values \(\varvec{X_g}\) of a gene g in all \(N_c\) repcells is distributed as$$\begin{aligned} \varvec{X_g}\sim \mathscr {N}_{N_c}(0,I) \end{aligned}$$
(6)
and any row g of \(\hat{A_c}\) is, under the null model, a single instance drawn from that distribution.For a set of repcell-pairs \(\Pi\) and a single gene g, we now define \(\varvec{Y_g^{\Pi }}\) to be an affine transformation of \(\varvec{X_g}\) (both \(\varvec{Y_g^{\Pi }}\) and \(\varvec{X_g}\) are row vectors):$$\begin{aligned} \varvec{Y_g^{\Pi }}=\varvec{X_g} \cdot B(:,\Pi ) \end{aligned}$$
(7)
In the null case, the set of values \(S(g,\Pi )\) is a single draw of \(\varvec{Y_g^{\Pi }}\).At this point we wish to clarify that although the standardized expression values of a gene g (denoted by the vector \(\varvec{X_g}\)) are assumed, under the null model, to be collectively independent, such assumption regarding their pairwise differences (represented by \(\varvec{Y_g^{\Pi }}\)) is not valid. Also, if \(B(:,\Pi )\) is not full rank, \(\varvec{Y_g^{\Pi }}\)’s distribution is singular.We now define:$$\begin{aligned} \overline{\varvec{Y_g^{\Pi }}}=\varvec{Y_g^{\Pi }} \cdot u^{|\Pi |} \end{aligned}$$
(8)
where \(u^{|\Pi |}\) is a column vector with all elements equal to \(\frac{1}{|\Pi |}\), i.e. an averaging vector.We can conclude from Eqs. (6), (7) and (8) that \(\overline{\varvec{Y_g^{\Pi }}}\) is a real valued normal variable with mean 0 and variance \(\widetilde{\sigma }^2 = {u^{|\Pi |}}^T \cdot B(:,\Pi )^T\cdot B(:,\Pi ) \cdot u^{|\Pi |}\). That is,$$\begin{aligned} \overline{\varvec{Y_g^{\Pi }}}\sim \mathscr {N}(0,\widetilde{\sigma }^2={u^{|\Pi |}}^T \cdot B(:,\Pi )^T\cdot B(:,\Pi ) \cdot u^{|\Pi |}) \end{aligned}$$
(9)
(see Fig. 8).Therefore, the significance of the observed average change in expression values of gene g over the set of repcell-pairs \(\Pi\) can be computed as$$\begin{aligned} pVal(g,\Pi ) = 1-\Phi \Big (\frac{S(g,\Pi )\cdot u^{|\Pi |}}{\sqrt{{u^{|\Pi |}}^T \cdot B(:,\Pi )^T\cdot B(:,\Pi ) \cdot u^{|\Pi |}}} \Big ) \end{aligned}$$
(10)
where \(\Phi\) is the cumulative distribution function of the standard normal distribution, the numerator is the observed average of values in \(S(g,\Pi )\) (the average of the observed expression differences of g throughout \(\Pi\)), and the denominator is \(\overline{\varvec{Y_g^{\Pi }}}\)’s standard deviation, as developed above.We note that this test is one-sided, since we observe the value of the average change in expression of a gene g over the repcell-pairs \(\Pi\) (this value is: \(S(g,\Pi )\cdot u^{|\Pi |}\)), and then compute its p-value \(pVal(g,\Pi )\) as the probability of observing a value this large (or larger).Figure 8A flow chart of the null assumption of our model.Computing the significance of a set of genes and a set of repcell-pairs. For the null model we also assume that the genes are collectively independent. We can then compute the significance of a given sub-matrix with columns (repcell-pairs) \(\Pi\) and rows (genes) \(\Gamma\) by multiplying the significance scores of the individual genes:$$\begin{aligned} pVal_{init}(\Gamma ,\Pi ) = \prod _{g \in \Gamma } {pVal(g,\Pi )} \end{aligned}$$
(11)
Correcting for multiple testing. We correct the initial p-value for multiple testing as follows$$\begin{aligned} pVal(\Gamma ,\Pi ) = \left( {\begin{array}{c}M\\ |\Gamma |\end{array}}\right) {\left( {\begin{array}{c}N_c\cdot (N_c-1)\\ |\Pi |\end{array}}\right) }\cdot pVal_{init}(\Gamma ,\Pi ) \end{aligned}$$
(12)
where M is the total number of genes and \(N_c\) is the total number of repcells. Note that \(N_c\cdot (N_c-1)\) is then the total number of repcell-pairs.Evaluating the structure of repcells consisting of the sub-matrix columnsThe significance score indicates how statistically rare a discovered sub-matrix would be under the null hypothesis. However, we argue that due to the properties of real-life data, there is a need for a second measure to assess biological significance. For that purpose, we examine the structure consisted of the set of repcell-pairs.We may think of a set of repcell-pairs \(\Pi =\{(i,j),(t,s),…\}\) as a network in which the nodes are repcells, and an arrow is drawn from repcell i to repcell j only if the repcell-pair (i, j) is in \(\Pi\).Using three toy examples, we will now demonstrate that different sub-matrices may have similar significance scores (and also, similar number of repcell-pairs), but very different contributions to the understanding of biological processes in the data. We will then offer a second measure that addresses biological significance.Consider the structures in Fig. 9: All three have 33 repcell-pairs (arrows), and their numbers of participating genes are 100, 30 and 50 for structures (a),(b),(c) respectively. We assume \(\alpha =1\), \(\mu =1\), and that the difference in expression values along each of the arrows in structures (a) and (b) is exactly 1 for each of the genes (i.e. \(S(g,(i,j))=1 \;\; \forall g\in \Gamma , (i,j)\in \Pi\)), and for structure (c) this is also true for all arrows but the ones connecting the first and the last layers (i.e. \(S(g,(i,j))=1 \;\; \text {if} \;\; \big [(i\in \{0,1,2\} \; \& \; j\in \{3,4,5,6\}) \;||\; (i\in \{3,4,5,6\} \; \& \; j\in \{7,8,9\})\big ]\), otherwise \(S(g,(i,j))=2\)). In this case their initial significance scores (computed as in Eq. (11)) are 1e-79.0, 1e-68.5 and 1e-64.4 respectively. Despite the similarity in significance scores, it is clear that while structures (b) and (c) might suggest evidence to a biological process in the data, structure (a) might as well be a result of a technical issue involving one outlier.As mentioned before (see Eq. (9)), the average of differences along all repcell-pairs in the structure is distributed in the null case as a normal variable. Consider the standard deviation of this variable, denoted by \(\widetilde{\sigma }\). When it is lower, the structure is more complex and might offer more biological insights. For structures (a), (b) and (c), \(\widetilde{\sigma }\) is 1.02, 0.58 and 0.52 respectively, allowing us to easily determine which of the structures is more informative.Figure 9Toy examples of structures.Filtering sub-matricesAfter evaluating the statistical significance of the resulting sub-matrices, we filter out non-significant ones (p-value above 1e-15). Note that in random data, we get no structures with p-value below this threshold. Then, we perform a disjointification-like process to avoid having very similar sub-matrices in the algorithm output list. The process begins by sorting the sub-matrices in an ascending order of their structures’ \(\widetilde{\sigma }\)s (see former section). Then, we go through the list of sub-matrices and remove every sub-matrix that is similar to any of the former sub-matrices in the list, where two sub-matrices are considered similar if the Jaccard index of their gene lists is above 0.75 or if the Jaccard index of their repcell-pair lists is above 0.5. At this point, different datasets may result in very different numbers of structures. To assure that the resulted number of structures is within a certain range, we cut the structure list (still, sorted by structures’ \(\widetilde{\sigma }\)’s) at \(\min (\max (3, n_{\widetilde{\sigma }\le 0.8}), 50)\), where \(n_{\widetilde{\sigma }\le 0.8}\) is the number of structures with \(\widetilde{\sigma }\le 0.8\).Post-filtering of structures: At SPIRAL’s results panel, a user may narrow down the list of structures by adjusting the Jaccard index threshold (JaccThr) of the gene lists.SPIRAL visualizations and gene enrichmentProducing the network layout and partitioning the repcells to layersFor structures that can be drawn as a bipartite graph (such as the toy examples in Figs. 9a and 9b), we can label the repcells on the left (layer 1 in the network) as “low expression” and the repcells on the right (layer 2) as “high expression”. However, for other structures, there are several options of partitioning into network layers. We chose to draw the network from right to left, so that the arrows are short as possible (see Algorithm 2). After the partition is established, we assume that in general the expression levels of the structure genes are increasing from left to right: the repcells on layer i would likely have lower expression values of the structure genes than the repcells on layer \(i+1\). In the simple network visualization, the repcells are colored based on their assigned layer. However, to test the former assumption, SPIRAL also produces a network visualization in which the repcells are colored based on their average expression level of the structure genes (see layout E.1 in Fig. 2d and network layouts in Fig. 3a). The sizes of nodes in the network layout correspond to their degree: the number of edges connecting them to other repcells. This is a visual indication of the level of connectivity of each repcell in the structure.Algorithm 2The network drawing processVisualizing significant sub-matrices on a PCA\(\backslash\)UMAP layout of repcellsSPIRAL visualizes structures over a PCA \(\backslash\)UMAP layout by coloring the repcells based on their assigned layers (see Methods, “Evaluating the structure of repcells consisting of the sub-matrix columns”; layout E.2 in Fig. 2d). As with the network layout, SPIRAL also produces a more informative visualization by coloring the repcells based on their average expression level of the structure genes.Visualizing significant sub-matrices on a PCA\(\backslash\)UMAP\(\backslash\)spatial layout of cells\(\backslash\)spotsSPIRAL visualizes structures on a PCA\(\backslash\)UMAP layout of the original cells\(\backslash\)spots, and on the spatial coordinates of the spots, by assigning each cell\(\backslash\)spot with a layer based on its repcell (see Methods, “Evaluating the structure of repcells consisting of the sub-matrix columns”; PCA\(\backslash\)UMAP layouts: layout E.3 in Fig. 2d and middle-left pane of each of structures in Fig. 3a; spatial layouts: Fig. 6a). SPIRAL also offers a similar visualization, in which the assigned layers are ignored, and instead- the cells\(\backslash\)spots are colored based on their average expression level of the structure genes (PCA\(\backslash\)UMAP layouts: layouts A-F in Fig. 2d, middle-right pane of each of structures in Fig. 3a, and Fig. 4b; spatial layouts: Figs. 5a and 7a).Evaluating the set of genes \(\Gamma\)
In order to assess whether the structure genes are enriched for known biological processes or functions, SPIRAL performs a GO enrichment query through GOrilla52,53, using the “two unranked lists of genes” mode.The structure gene list is used as the target list, while all the genes in the data set are used as the background list. All three GO ontologies are queried: process, function and component. Then, GO terms with p-values of less than 0.01 are saved to the results file.Producing synthetic dataWe used Splatter22 to synthesize scRNA-seq data with a branching lineage of 7000 cells and 10000 genes. The lineage was constructed such that each cell belongs to one of three paths (with probabilities 0.2, 0.2, 0.6 to belong to Path1, Path2, Path3 respectively). Path2 and Path3 begin where Path1 ends. Also- the location of each cell along its path (i.e. step) is also known (overall 1000, 1000, 3000 steps in Path1, Path2, Path3 respectively). We defined the differential expression factor “de.facLoc” to be 0.01, which is considered small (which translates to small differences between cells in different paths)54. We then used the lineage rowdata file to construct a list of “true gene modules” by selecting the genes with differential expression factor (DEFac) above 1.1 or below 0.9 (separately), for each path, creating overall six true gene modules for the three paths.Implementation of other methodsSPIRAL outputs structures, each contains a set of genes working in concordance with one another. Thus, we sought to compare SPIRAL to other methods that output multiple sets of genes.Seurat55,56 is a commonly used pipeline which performs clustering of the cells (for single cell data) or spots (for spatial data), and outputs differentially expressed genes for each cluster with the “FindAllMarkers” function.Slingshot57 is a well-established package for pseudo-time inference3, which may be accompanied by the tradeSeq58 pipeline to find sets of genes with similar expression patterns with the “clusterExpressionPatterns” function. As there is no default value for the “npoints” parameter required for this function, we ran it with “npoints” equals 20 and 200. We then filtered out sets with less than 20 genes.Hotspot59 is a graph-based procedure to identify informative genes and gene modules in single cell and spatial datasets. In Hotspot, the user has to define a model and the “n_neighbors” parameter. We ran Hotspot with the ’danb’ model and “n_neighbors” equals 30 and 300 for all datasets but the Zebrafish differentiation dataset. Several attempts to run Hotspot on the this dataset ended in error (probably due to the relatively large number of cells: 10, 500). We also tried to run Hotspot with the ’bernoulli’ model but unfortunately we encountered errors.Nonsmooth nonnegative matrix factorization (nsNMF) is a procedure that can be used to find gene modules in single cell and spatial datasets, as done in Moncada et al.60, Carmona-Saez et al.61. We applied this procedure by using the R NMF package62.SingleCellHaystack2 is a clustering-independent method for finding differentially expressed genes in single-cell and spatial data, which takes as input the expression matrix as well as dimensionality reduction coordinates for single cell data, or spatial coordinates for spatial data. After detecting differentially expressed genes, it clusters them to k modules, where k is a user defined parameter. We ran SingleCellHaystack with k equals 3, 5 and 6 for the synthetic dataset, and with k equals 3 and 5 for the real datasets.SpatialDE16 is a method to identify spatially variable genes in spatial datasets. These genes may later be clustered to C clusters, where C is a user-defined parameter. We ran SpatialDE with C equals 3 and 5 for all spatial datasets.We ran these methods on a standard Linux machine (64GB RAM, 10 cores).Generating bulk RNA-seq dataThe following pipeline was used to generate the bulk RNA-seq dataset of human embryonic stem cells (hESCs) differentiation.Spontaneous differentiation of hESCsSuspension cultures of TE03 cells were routinely maintained in hESC-medium with 100 ng/ml bFGF. To induce spontaneous differentiation, cells were transferred to differentiation medium consisting of DMEM/F-12 supplemented with 10% FBS (BI, Cat\(\#\) 04-002-1A), 10% KnockOut Serum Replacement (KSR), 1 mM L-glutamine, 1% non-essential amino acids, and 0.1 mM \(\beta\)-mercaptoethanol, media changed every 1-2 days. After one week media was replaced with FBS media, consisting of DMEM supplemented with 20% FBS 1 mM L-glutamine, 1% non-essential amino acids, and 0.1 mM \(\beta\)-mercaptoethanol, media changed every 2-4 days for 2 weeks. Samples collected on the indicated days, RNA extracted by TRI-reagent according to protocol.Library preparation and sequencing2 ng RNA of each sample was taken for library preparation using the CEL-Seq2 protocol63. Briefly, The CEL-Seq primer selects for polyA RNA via an anchored polyT stretch, and adds a sample specific barcode, a UMI, the Illumina adaptor and a T7 promoter sequence to the resulting dsDNA (2 different CEL-Seq primers added to each sample as technical replicates). At this point, multiple samples can be pooled together, and an In-Vitro Transcription (IVT) reaction is performed to linearly amplify the RNA. The second Illumina adaptor is introduced at a second Reverse Transcription step (RT) as an overhang to a random hexamer primer. A short PCR reaction selects for the 3’ most fragments and add the full Illumina adaptors needed for sequencing. Sample and molecule origin are identified by read 1, and gene of origin is identified by read 2. Library was sequenced on Nextseq550, 25 bases for read 1 and 60 bases for read 2.Bioinformatic analysisDemultiplexing was performed using Je-Demultiplex64 in Galaxy platform. R2 reads were split into their original samples using the CEL-Seq barcode from R1. Reads from two technical replicates were joint for further processing. The reads were cleaned using Cutadapt65 for removal of adaptors, polyA, low-quality sequences (Phred\(< 20\)) and short reads (\(<30\)bp after trimming). Reads were mapped to the GRCh38 genome (Ensembl) using STAR package66 to create Bam files. We used SAMtools67 to convert BAM to SAM file and index the files. Then the reads were annotated and counted using HTseq-count package68.

Hot Topics

Related Articles