Interpretable spatially aware dimension reduction of spatial transcriptomics with STAMP

STAMP modelSection 1 model compositionSTAMP is a Bayesian model that decomposes the expression xng of gene g in cell n into topic proportion znk (the proportion of the k-th topic in cell n) and gene embedding βkg of gene g in the k-th topic. We model the gene expression xng as a sample drawn from a Gamma Poisson distribution, where μng is the mean expression of gene g in cell n and αg is the gene-specific dispersion term.$${x}_{{ng}} \sim {\rm{GammaPoisson}}\left({\mu }_{{ng}},{\alpha }_{g}\right)$$
Mean term μ
ng

We model the mean term μng as a summation over k topics of the multiplication between the library size ln, the topic proportion znk and the gene embedding βkg$${\mu }_{{ng}}={l}_{n}\sum _{k}{z}_{{nk}}{\beta }_{{kg}}$$where k labels the k-th topic.

Library size l
n

We model the library size as the observed total gene counts for each cell:$${l}_{n}=\sum _{g}{x}_{{ng}}$$

Topic proportion z
nk

We model the topic proportion as a logistic-normal prior that we softmax over the topic dimension to ensure that the topic proportion per cell sums to 1. We model the covariances across topics in the form of UUT + σI decomposition, where I refers to the identity matrix. The rank of the covariance matrix (dimensionality of uk) is a hyperparameter that we set equal to the number of topics.$${u}_{k}\sim {\rm{Normal}}\left(0,\bf{I}\,\right)$$$$\sigma \sim {\rm{HalfCauchy}}\left(1\right)$$$${\widetilde{z}}_{n}\sim {\rm{Normal}}\left(0,\bf{UU}^{\rm{T}}+\sigma \bf{I}\right)$$$${z}_{{nk}}={\rm{softmax}}_{k}\left({\widetilde{z}}_{{nk}}\right)$$

Gene embedding β
kg

Scenario 1: single sample
We model the gene embedding βkg as the sum of the background residual term rg and the gene module score wkg. The residual term rg captures the gene module shared across all topics, downweighing genes that are highly expressed across all topics. We model it as a normal prior centered around the log of the observed log-normalized mean \({\bar{x}}_{g}\), with a small \(\epsilon\) set to 1 × 10−8 to ensure positivity. The gene module score wkg is modeled with a structured horseshoe prior that introduces shrinkage to the gene module by selectively loading genes onto the topic only if they have sufficient information to overcome the prior, thereby giving rise to more robust gene modules.$${r}_{g}\sim {\rm{Normal}}\left(\log (\bar{x}_{{\boldsymbol{g}}}+\epsilon ),1\right)$$$${\delta }_{g}\sim {\rm{HalfCauchy}}\left(1\right)$$$${\tau }_{k}\sim {\rm{HalfCauchy}}\left(1\right)$$$${\lambda }_{{kg}}\sim {\rm{HalfCauchy}}\left(1\right)$$$$c\sim {\rm{InverseGamma}}\left(0.5,0.5\right)$$$${\widetilde{\lambda }}_{{kg}}={\delta }_{g}{\tau }_{k}{\lambda }_{{kg}}$$$${w}_{{kg}}\sim {\rm{Normal}}\left(0,\frac{{c}^{2}{{\widetilde{\lambda }}^{2}}_{{kg}}}{{c}^{2}+{{\widetilde{\lambda }}^{2}}_{{kg}}}\right)$$$${\beta }_{{kg}}={\rm{softmax}}_{g}\left({w}_{{kg}} + {r}_{g}\right)$$Each level of the structured horseshoe prior hierarchy contributes to the structured sparsity. Specifically, δg controls the gene-wise sparsity. When δg approaches 0, the prior on wkg approaches a spike at zero for gene g, deeming the gene irrelevant for all topics. The same principle applies to τk, which encourages topic-wise shrinkage. λkg controls element-wise shrinkage, allowing each gene module to turn off specific genes that are irrelevant to its associated topic. When \({c}^{2}\ll {{\widetilde{\lambda }}^{2}}_{{kg}}\), the prior on wkg approaches normal(0, c2), therefore regularizing wkg even when the parameters are weakly identified.
Scenario 2: multiple sample
For each batch, we model the gene-wise batch effect of gene g of batch s, \({\delta }_{{sg}}^{{batch}}\), as a zero-mean Student’s t-distribution with a small s.d. of 0.01 and degree of freedom of 10 as we expect most of the genes to be unaffected by batch effects. The Student’s t-distribution has heavier tails compared to the normal distribution, allowing the possibility of accommodating large batch effect present in the data. The topic-wise batch effect of topic k, \({\tau }_{k}^{{batch}}\), is modeled as a β prior, controlling the amount of batch effect present in each topic. The Beta prior puts more mass on both ends (near 0 and 1). The contribution of batch effect \({w}_{{skg}}^{{batch}}\) is therefore given by an outer product of the gene and topic-wise batch effect terms. Last, the batch-specific gene embedding \({\beta }_{{skg}}^{{batch}}\) is the sum of \({w}_{{skg}}^{{batch}}\), gene module wkg and background residual rg. The former two terms are retained from the last scenario.$${\delta }_{{sg}}^{{batch}}\sim {\rm{StudentT}}\left(10,0,0.01\right)$$$${\tau }_{k}^{\,{batch}}\sim {\rm{Beta}}\left(0.5,0.5\right)$$$${w}_{{skg}}^{{batch}}={\tau }_{k}^{\,{batch}}\otimes {\delta }_{{sg}}^{{batch}}$$$${\beta }_{{skg}}^{{batch}}={\rm{softmax}}_{g}\left({w}_{{kg}}+{r}_{g}+{w}_{{skg}}^{{batch}}\right)$$
Scenario 3: time series
In the context of time-series data analysis, we expect the gene module wkg to vary across time points t. Here, we model each wkg as an independent Gaussian process with a Matern 3/2 kernel. The Matern 3/2 kernel κ(t, t′) consists of two hyperparameters, the output variance \({\widetilde{\sigma}}_{{kg}}^{2}\) that controls the average distance of the function from its mean and the length scale parameter l that determines the smoothness of the function. We fix l to 1 as we want to recover gene modules that are coherent across time and set \({\widetilde{\sigma }}_{{kg}}^{2}\) to be the regularized horseshoe prior described in the previous section. Last, we put a normal prior on the background residual term rtg at each time point for each gene.$${\delta }_{g}\sim {\rm{HalfCauchy}}\left(1\right)$$$${\tau }_{k}\sim {\rm{HalfCauchy}}\left(1\right)$$$${\lambda }_{{kg}}\sim {\rm{HalfCauchy}}\left(1\right)$$$$c\sim {\rm{InverseGamma}}\left(0.5,0.5\right)$$$${\widetilde{\lambda }}_{{kg}}={\delta }_{g}{\tau }_{k}{\lambda }_{{kg}}$$$${\widetilde{\sigma }}_{{kg}}^{2}=\frac{{c}^{2}{{\widetilde{\lambda }}^{2}}_{{kg}}}{{c}^{2}+{{\widetilde{\lambda }}^{2}}_{{kg}}}$$$${\kappa }_{{kg}}\left(t,{t}^{{\prime} }\right)={\widetilde{\sigma }}_{{kg}}^{2}\left(1+\frac{\sqrt{3}\left(\left|t-{t}^{{\prime} }\right|\right)}{l}\right){e}^{\left(\frac{\sqrt{3}\left|t-{t}^{{\prime} }\right|}{l}\right)}$$$${w}_{{kg}}\sim {GP}\left(0,{\bf{K}}_{{kg}}\right)$$$${r}_{{tg}}\sim {\rm{Normal}}\left(\log (\bar{x}_{{tg}}+\epsilon ),1\right)$$$${\beta }_{{tkg}}={\rm{softmax}}_{g}\left({w}_{{tkg}}+{r}_{{tg}}\right)$$

Dispersion term α
g

We model the square root of the dispersion as a Half-Cauchy prior. The prior places most of its mass toward zero, signaling that most of the genes do not exhibit overdispersion.$$\sqrt{{a}_{g}} \sim {\rm{Half-Cauchy}}\left(1\right)$$
Section 2 inferenceBlack-box variational inference is used to approximate the posterior, building on the automatic differentiation variational inference (ADVI) framework from Pyro71. The joint probability for the single sample scenario is given by$$\begin{array}{l}p\left(\bf{X},\bf{\Theta} ,{\bf{Z}}\right)=p\left(\bf{\Delta} \right)p\left(\bf{\Lambda} \right)p\left({\bf{T}}\right)p\left(c\right)p\left({\bf{W}}|{\bf{T}},\bf{\Lambda} ,\bf{\Delta} ,{c}\right)p\left(\bf{U}\right)\\p\left(\sigma \right)p\left(\bf{Z}{\rm{|}}\bf{U},\sigma \right)p\left(\bf{A}\right)p\left(\bf{R}\right)p\left(\bf{X}|{\bf{W}},{\bf{Z}},{\bf{A}},{\bf{R}}\right),\end{array}$$where Θ = Δ, Λ, T, c, W, U, σ, A, R and$$\begin{array}{l}\bf{X}=\left\{{x}_{ng}\right\},{\bf{T}}=\left\{{\tau }_{k}\right\},\bf{\varDelta} =\{{\delta}_{g}\},\bf{\varLambda} =\left\{{\lambda }_{{kg}}\right\},{\bf{A}}=\left\{{\alpha }_{g}\right\},\bf{W}\\\quad\,=\left\{{w}_{{kg}}\right\},\bf{Z}=\left\{{z}_{{nk}}\right\},\bf{R}=\left\{{r}_{g}\right\},\bf{U}=\left\{{u}_{k}\right\}.\end{array}$$To approximate the posterior p(Θ|X), we choose the mean-field variational family \({q}_{\phi }\left(\bf{\Theta} \right)=\prod {q}_{\phi }\left(\Theta \right)\), which factorizes into independent distributions for each parameter Θ. Specifically, we utilize normal distributions for the real parameters W, U, R and lognormal distributions for the positive parameters Δ, Λ, T, c, σ and A, ensuring that they have the same support as the prior distribution.To approximate the posterior for p(Z|X), we make use of simplified graph convolutional network that also takes in spatial information. The simplified graph convolutional network takes in the gene expression counts and the spatial adjacency graph, and outputs the mean Zu and variance Zσ for the variational parameters qϕ (Z|X, A).$$\widetilde{\bf{A}}=\bf{A}+\bf{I}$$$$\bf{S}={\widetilde{\bf{D}}}^{-\frac{1}{2}}\widetilde{\bf{A}}{\widetilde{\bf{D}}}^{-\frac{1}{2}}$$$${\bf{Z}}_{u},{\bf{Z}}_{\sigma }={NN}\left(\left[\bf{X},\bf{SX},\ldots ,{\bf{S}}^{l}\bf{X}\right]\right)$$$${q}_{\phi }\left(\bf{Z}|\bf{X},\bf{A}\right)={\rm{softmax}}_{k}\left({\rm{Normal}}\left({\bf{Z}}_{u},{\bf{Z}}_{\sigma }\right)\right),$$where A is the adjacency matrix, I is the identity matrix, \(\widetilde{\bf{D}}\) is the degree of \(\widetilde{\bf{A}}\), X is the gene expression matrix and NN is the neural network used. Therefore, the ELBO is given by$$L={E}_{{q}_{\phi \left(\bf{\Theta} ,{\bf{Z}}\right)}}\left[\log p\left(\bf{X},\bf{\Theta} ,{\bf{Z}}\right)-\log {q}_{\phi }\left(\bf{\Theta} ,{\bf{Z}}\right)\right]$$where ϕ denotes the learnable parameters of both the graph neural network and the variational posteriors. The first term represents the expectation of the joint density with respect to the variational distribution q, while the second term is the entropy of the variational distribution. The gradient of the ELBO is given by$$\nabla L={\nabla }_{\phi }{E}_{{q}_{\phi \left(\bf{\Theta} ,{\bf{Z}}\right)}}\left[\log p\left(\bf{X},\bf{\Theta} ,{\bf{Z}}\right)-\log {q}_{\phi }\left(\bf{\Theta} ,{\bf{Z}}\right)\right].$$We compute a noisy, unbiased gradient estimate using unbiased Monte-Carlo estimates that rely on reparametrized gradients16. This approach stabilizes the optimization procedure by reducing variance and allows for optimization through stochastic optimization techniques such as Adam72. We provided more details of the training and simplified graph convolution network in Supplementary Note 1.Outputs and post-processingThe output topic proportions znk and gene module scores wkg are taken to be the mean of the variational posteriors q(znk|X, A) and q(wkg), respectively. We further post-process the gene module scores by downweighing the lowly-expressed genes. The new gene module score is given by$${w}_{{kg}{\rm{\_}}{new}}={w}_{{kg}}-\log \left({r}_{g}+\epsilon \right)+\log \left({r}_{g}\right)$$where rg is the mean of the variational posterior q(rg). We set \(\epsilon\) to the tenth quantile of rg.Binarization of topicsEach cell was assigned to a topic based on the largest value in its topic proportion.Method comparisonWe compared STAMP with several existing methods that can identify topics and their corresponding gene modules. We limited the competing methods to those whose latent topics are non-negative, therefore omitting methods such as PCA and its related extensions such as SpatialPCA and MEFISTO12,26. Therefore, we compared STAMP to NMF, linear decoded variational autoencoder (LDVAE), LDA, non-negative spatial matrix factorization hybrid (NSFH) and SpiceMix.NMFNMF factorizes a given gene expression matrix into two non-negative matrices, which can be interpreted as a set of underlying ‘metagenes’ and ‘metasamples’. We ran NMF from the scikit-learn package with the Kullback divergence loss and set max-iter to 1,000 to avoid convergence errors. The KL loss NMF is equivalent to a NMF with Poisson likelihood.LDALDA is a generative statistical model under a family of models named topic models that was originally developed for text count data. It assumes a Dirichlet prior on top of the generating process compared to NMF. We ran LDA from the scikit-learn package with its default parameters.LDVAELDVAE is a deep latent factor model which replaces the neural network decoder of scVI73 with a linear decoder. We ran LDVAE with the default parameters except for the use of a logistic-normal distribution as the latent topics’ distributions to enforce positivity.NSFHNon-negative spatial factorization is a spatially aware probabilistic dimension reduction model based on transformed Gaussian processes and NMF. We ran NSFH with the default parameters, with half of the factors as spatial factors and the other half as nonspatial factors as suggested. We also used 3,000 inducing points.SpiceMixSpiceMix is a spatially aware probabilistic dimension reduction model that uses NMF and hidden Markov random field to model the spatial dependencies. We ran SpiceMix with the default parameters except for using k-means as the initialization scheme to obtain the exact number of topics as we failed to obtain the right number of topics with the default initialization scheme.DeepSTDeepST is a tool that aligns and integrates multiple spatial transcriptomics datasets. It makes use of data augmentation for preprocessing. It uses graph neural networks to incorporate spatial information and a denoising autoencoder for integration. We ran DeepST with its default parameters.GraphSTGraphST is a tool that aligns and integrates multiple spatial transcriptomics datasets. It makes use graph neural networks with a contrastive loss to incorporate spatial information. We ran GraphST with its default parameters.PRECASTPRECAST is a probabilistic algorithm that aligns and integrates multiple spatial datasets. Precast combines factor analysis with an intrinsic autoregressive component for integration and incorporating spatial information. We ran PRECAST with the default parameters.STAlignerSTAligner is a tool that aligns and integrates multiple spatial transcriptomics datasets. It uses graph attention neural networks to incorporate spatial information and a triplet loss for integration. We ran STAligner with its default parameters.Evaluation metricsTo evaluate the gene modules found, we used two metrics popular in topic modeling. The first is module coherence, which we measure with normalized pointwise mutual information (NPMI). NPMI quantifies the co-occurrence of genes, and the module coherence is calculated by taking the mean of NPMI.$${\mathrm{Module}}\; {\mathrm{coherence}}=\frac{1}{K}\mathop{\sum }\limits_{k=1}^{K}\frac{1}{190}\mathop{\sum }\limits_{i=1}^{20}\mathop{\sum }\limits_{j=i+1}^{20}\frac{{\log }_{2}\frac{P({g}_{i},{g}_{j})}{P\left({g}_{i}\right)P({g}_{j})}}{-{\log }_{2}P\left({g}_{i},{g}_{j}\right)}.$$where P(gi, gj) refers to the joint probability of two genes occurring in the same cell, P(gi) refers to the probability of observing gene i and K is the number of topics. We use the observed counts of each gene to measure the probabilities. To prevent housekeeping or background genes from dominating the metric, we only consider a gene to be present in a cell if its expression falls within the top 25th percentile. We used the top 20 ranked genes from the topic’s gene module for our evaluation. The unnormalized version of the metric has been used in single-cell studies74,75.The second is module diversity that measures how unique gene modules are using ranked bias overlap (RBO)30. RBO compares two gene modules Mi and Mj of equivalent size and returns a number between 0 and 1, where 1 means the two gene modules are identical and 0 means they are completely unique. We calculate RBO scores between each pair of gene modules using the top 20 ranked genes. The module diversity score is then calculated by taking the mean of the minimum (1 – RBO) score per topic. This ensures that modules with a higher diversity obtain a higher score.$${\mathrm{Module}}\; {\mathrm{diversity}}=\frac{1}{K}\mathop{\sum }\limits_{i=1}^{K}\min \left(1-\mathrm{RBO}({M}_{i},{M}_{j}){for\;j}\in K\right)$$To evaluate the batch correction capabilities of the different methods, we utilized the scIB metrics. We employed the scib metrics package from https://github.com/YosefLab/scib-metrics to generate the results.Data resources and preprocessingThe spatial adjacency graphs were built with the function squidpy.pp.spatial_neighbors with six nearest neighbors (one hexagonal ring) for structured data (10x Genomics Visium) or with the number of neighbors equivalent to 1/1,000 of the whole dataset. Highly variable genes were selected using Scanpy’s highly variable genes command with flavor = ‘seurat_v3’, which follows the Seurat’s pipeline.Mouse hippocampus data, Slide-seq V2We obtained the Slide-seq V2 data from the SeuratData package (https://github.com/satijalab/seurat-data). We then converted it to a h5ad object for further analysis. We filtered out genes that are expressed in less than 1% of the data and cells that have fewer than 100 genes. We then selected for 6,000 highly variable genes.Human non-small cell lung cancer data, CosMx SMIWe obtained the processed Giotto76 object from https://nanostring.com/products/cosmx-spatial-molecular-imager/nsclc-ffpe-dataset. The data were then transformed into a h5ad object. We filtered out genes that are expressed in less than 1% of the data and cells that have fewer than 50 counts. We then selected for 600 highly variable genes.Mouse brain anterior and posterior data, 10x Genomics VisiumWe obtained the 10x Genomics Visium mouse brain data from the 10x Genomics Data Repository (https://www.10xgenomics.com/resources/datasets). The two files are Mouse Brain Serial Section 2 (sagittal–anterior) and Mouse Brain Serial Section 2 (sagittal–posterior). We filtered out genes that are expressed in less than 3% of the cells and mean nonzero expression <1. We then selected for 2,000 highly variable genes.DLPFC, 10x Genomics VisiumThe dataset of 12 human DLPFC tissue sections were obtained from https://github.com/LieberInstitute/spatialDLPFC. We filtered out genes that are expressed in less than 3% of the cells. We then selected for 4,000 highly variable genes.Mouse olfactory bulb data, Stereo-seq/Slide-seq V2/10x Genomics VisiumWe obtained the Stereo-seq data from https://db.cngb.org/stomics/mosta/download/. The Slide-seq V2 data were obtained from https://singlecell.broadinstitute.org/single_cell/study/SCP815/highly-sensitive-spatial-transcriptomics-at-near-cellular-resolution-with-slide-seqv2#study-download. The 10x Genomics Visium data were obtained from https://www.10xgenomics.com/datasets/adult-mouse-olfactory-bulb-1-standard. We first filtered out cells that have fewer than 50 genes and filtered out genes that are expressed in less than 3%, 1%, 1% of the cells for 10x Genomics Visium, Slide-seq v2 and Stereo-seq, respectively. We concatenated the data and then selected a total of 6,000 highly variable genes.Mouse embryo data, Stereo-seqWe obtained the 50-binned mouse embryo file (Mouse_embryo_all_stage.h5ad) from https://db.cngb.org/stomics/mosta/download/ which included time points from E9.5 to E16.5. We replaced the time point E10.5 with E10.5_E2S1.MOSTA.h5ad. We first filtered out cells that have fewer than 50 genes and genes that are expressed in less than 1% of the cells for each time point. We next selected 6,000 highly variable genes, to which we then applied another filtering of spatially variable genes with Moran’s I down to 2,000 genes.Over-representation and gene set enrichment analysis of gene modulesWe performed over-representation analysis to identify corresponding cell types associated with each topic with the top 20 genes for each topic. We used the enrich function from gseapy77 to conduct over-representation analysis with our gene modules scores as inputs. For the human CoxMx SMI Lung data, we used the scEnrichment function from the DISCOtoolkit available at https://github.com/JinmiaoChenLab/DISCOtoolkit_py. For the Stereo-seq mouse embryo data, we used the gsea function from gseapy to conduct GSEA with our gene modules score as input. We used the Gene Ontology Biological Process gene sets from MSigDB78 as our input gene sets.Differential gene expression analysis of gene modulesWe performed differential expression analysis to identify marker genes of each topic. We used the Gamma Poisson generalized linear models provided by glmGamPoi79 to fit the coefficients for each topic, followed by a quasi-likelihood ratio test with empirical Bayesian shrinkage to identify differentially expressed genes.Computation server employedAll analyses were conducted on a server equipped with an Intel Core i7-8665U CPU and an NVIDIA Titan V GPU with 12 GB of memory.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles