Detecting anomalous anatomic regions in spatial transcriptomics with STANDS

Anomalous tissue domain detectionAs illustrated in the C1 part of Fig. 1b, module I consists of a generator and a discriminator. The generator itself is composed of four subcomponents: an encoder, a decoder, a transformer fusion (TF) block, and a memory bank. The encoder employs a GAT and a ResNet-GAT hybrid network to generate spot embeddings based on spatial gene expression data and the associated histology image, respectively:$${z}_{i}^{g}=\, {f}_{{GAT}}\left({x}_{i}^{g},\, G\,,\,{{{\bf{W}}}}_{1}\right),$$
(1)
$${z}_{i}^{p}=\,{f}_{{rn}-{GAT}}\left({x}_{i}^{p},\, G\,,\, {{{\bf{W}}}}_{2}\right),$$
(2)
were, \({x}_{i}^{g}\) denotes the gene expression vectors of spot \(i\), \({x}_{i}^{p}\) the segmented patches of the histology image, \(G=\left(V,E\right)\) the graph representation of all spatial spots, \({z}_{i}^{g}\) and \({z}_{i}^{p}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{d}\) the transcriptomic and image patch embeddings for spot \(i\), respectively. These embeddings are then fused using a TF block:$${z}_{i}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{2d}={TF}\left(\left[{z}_{i}^{g}{{\rm{||}}}{z}_{i}^{p}\right],{{{\bf{W}}}}_{3}\right).$$
(3)
The memory bank is essentially an embedding queue \({{\bf{Q}}}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{{N}_{{mem}}{{\boldsymbol{\times }}}2d}\) filled with \(z\), where \({N}_{{mem}}\) denotes the number of in-memory embeddings. It provides an attention-based means to reconstruct \(z\) as \(\widetilde{z}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{2d}\):$${\widetilde{z}}_{i}={{{\bf{Q}}}}^{T}{{\rm{softmax}}}\left(\frac{{{\bf{Q}}}{z}_{i}}{\tau }\right)=\left[{\widetilde{z}}_{i}^{g}\parallel {\widetilde{z}}_{i}^{p}\right]{{\boldsymbol{.}}}$$
(4)
where \(\tau\) is a temperature hyperparameter. \({{\bf{Q}}}\) is continuously updated during training by enqueuing recent \(\widetilde{z}\) and dequeuing the oldest to maintain a balance between preserving previously learnt features and adapting to new spots, thereby mitigating the mode collapse risk. The decoder consists of a Multi-Layer Perceptron (MLP) network and a ResNet-based deconvolution network for reconstructing the gene expression vector \({\hat{x}}_{i}^{g}\) and image patch \({\hat{x}}_{i}^{p}\) from their respective \({\widetilde{z}}_{i}^{g}\) and \({\widetilde{z}}_{i}^{p}\).The discriminator \(D\) comprises an encoder, similar to the generator’s encoder, and an MLP-based classifier. \(D\) is trained to distinguish between \(x=({x}^{g},\,{x}^{p})\) and \(\hat{x}=({\hat{x}}^{g},{\hat{x}}^{p})\). The total loss functions for the generator (\({{{\mathcal{L}}}}_{{Gen}}\)) and the discriminator (\({{{\mathcal{L}}}}_{D}\)) are defined as:$${{{{\mathcal{L}}}}_{{Gen}}={{\rm{\alpha }}}{{{\mathcal{L}}}}_{{rec}}+{{\rm{\beta }}}{{{\mathcal{L}}}}_{{adv}}=\alpha {\mathbb{E}}\left(\left(1-{{\rm{\gamma }}}\right){\|{x}^{g}-{\hat{x}}^{g}\|}_{1}+{{\rm{\gamma }}}{\|{x}^{p}-{\hat{x}}^{p}\|}_{1}\right)-{{\rm{\beta }}}\left({\mathbb{E}}\left[D\left(\hat{x}\right)\right]\right),}$$
(5)
$${{{\mathcal{L}}}}_{D}={\mathbb{E}}\left[D\left(\hat{x}\right)\right]{\mathbb{-}}{\mathbb{E}}\left[D\left(x\right)\right]+\lambda {\mathbb{E}}\left[{\left({\|\nabla D\left({{\rm{\xi }}}\right)\|}_{2}-1\right)}^{2}\right].$$
(6)
where \({{\rm{\xi }}}=\epsilon \hat{x}+\left(1-\epsilon \right)x,\,\epsilon \in \left({\mathrm{0,1}}\right)\). Here, \({{{\mathcal{L}}}}_{{rec}}\) denotes the data reconstruction loss, while \({{{\mathcal{L}}}}_{{adv}}\) the adversarial loss. \(\alpha,\beta,\) a\({nd}\) \(\lambda \ge 0\) represent the weights of each loss function, \({{\rm{\gamma }}}\in \left[{\mathrm{0,1}}\right]\) represents the relative importance between the gene expression and imagery data. \(D\left({\hat{x}}_{i}\right){{\boldsymbol{\in }}}{{\mathbb{R}}}^{h}\) is the discriminator’s output for \({\hat{x}}_{i}\), and \({\mathbb{E}}\left[ D\left({\hat{x}}_{i}\right)\right]\in \left[0,\,1\right]\) represents the probability that \({\hat{x}}_{i}\) is classified as real by \({D}\). Additionally, a gradient penalty term applied to \({{\rm{\xi }}}\) ensures the Lipschitz continuity of the discriminator and is critical for maintaining the stability of the adversarial training process39.When only transcriptomic data (scRNA-seq or ST) is available for referencing, GAN module I undergoes specific modifications. Specifically, in the case of cross-referencing scRNA-seq, the GAT-based encoder of the generator is replaced by a two-layer MLP to generate \({x}_{i}^{g}\):$${z}_{i}^{g}\in {{\mathbb{R}}}^{d} \,=\, {f}_{{MLP}}\left({x}_{i}^{g},{{{\bf{W}}}}_{1}^{*}\right){{\boldsymbol{.}}}$$
(7)
Similarly, the GAT-based encoder of the discriminator is replaced with the two-layer MLP. Moreover, without the image modality, the ResNet-based image encoder/decoder and the TF block are omitted so that \({z}_{i}={z}_{i}^{g}\) and the memory bank \({{\bf{Q}}}\) has a size of \({{\mathbb{R}}}^{{N}_{{mem}}{{\boldsymbol{\times }}}2d}\). All other components remain unchanged.Upon completing the training, STANDS is utilized to reconstruct spots in the target datasets. The reconstruction fidelity for a given spot \(j\) is quantified using an anomalous score (\({{{\mathcal{d}}}}_{j}\)), computed as the cosine dissimilarity between \(D\left({\hat{x}}_{j}\right)\) and \(D\left({x}_{j}\right)\):$${{{\mathcal{d}}}}_{j} \,=\, 1-\frac{{D\left({x}_{j}\right)}^{T}D\left({\hat{x}}_{j}\right)}{\|D\left({x}_{j}\right)\|\|D\left({\hat{x}}_{j}\right)\|}.$$
(8)
A higher value of \({{{\mathcal{d}}}}_{j}\) indicates a less accurate reconstruction, implying spot \(j\) is more likely to be an anomaly. As such, we model the anomaly scores’ distribution as a univariate Gaussian Mixture Model (GMM) with two components: one for anomalous spots (component 1) and the other for normal ones (component 2). We specify the prior for anomaly abundance as a beta distribution and the priors for the mean and variance of the two Gaussian components as a Normal Inverse Chi-squared (NIX) distribution. Utilizing the Maximum A Posteriori (MAP)-EM algorithm, we infer the parameters for both Gaussian components and then assign spots into either normal or anomalous groups based on their probabilities within each component. Specifically, let \(\Theta=\left\{\pi,{\mu }_{k},{\sigma }_{k}^{2},\forall k\in \left\{{\mathrm{1,2}}\right\}\right\}\) represent the GMM parameters, where \(\pi \in \left[{\mathrm{0,1}}\right]\) represents the proportion of anomalies, and \({\mu }_{k},{\sigma }_{k}^{2}\) represent the mean and variance for the \(k\)-th component, respectively, with the constraint that \({\mu }_{1} > {\mu }_{2}\). Then, the probability density function of \({{{\mathcal{d}}}}_{i}\) can be formulated as:$${{\rm{P}}}\left({{{\mathcal{d}}}}_{i}|\varTheta \right) \,=\, \pi {{\mathcal{N}}}\left({{{\mathcal{d}}}}_{i}|{\mu }_{1},{\sigma }_{1}^{2}\right)+\left(1-\pi \right){{\mathcal{N}}}\left({{{\mathcal{d}}}}_{i}|{\mu }_{2},{\sigma }_{2}^{2}\right),$$
(9)
$$\pi \sim {{\rm{Beta}}}\left(\pi |a,b\right),$$
(10)
$${\mu }_{k},{\sigma }_{k}^{2} \sim {{\rm{NIX}}}\left({\mu }_{k},{\varSigma }_{k}|{m}_{0},{\kappa }_{0},{s}_{0}^{2},{\nu }_{0}\right).$$
(11)
Parameters for the priors in the GMM are empirically set based on the reference dataset’s anomaly scores \({{{\rm{\delta }}}}_{i},\forall i\in \left[1,\, {N}_{{ref}}\right]\):$${m}_{0}=\frac{{\sum }_{i=1}^{{N}_{{ref}}}\, {\delta }_{i}}{{N}_{{ref}}},\,{\kappa }_{0}=\,0.01,\, {\nu }_{0}\,=\, 3,\, {s}_{0}^{2}=\frac{{\sum }_{i=1}^{{N}_{{ref}}}\left({\delta }_{i}-{m}_{0}\right)}{{N}_{{ref}}},$$
(12)
The values of \({{\rm{a}}}\) \({{\rm{and}}}\) \({{\rm{b}}}\) can be adjusted if prior knowledge about anomaly abundance is available. The complete data log likelihood for the posterior, denoted as \({{\ell}}_{{{\mathcal{c}}}}\left(\Theta \right)\), is expressed as:$${{\ell}}_{{{\mathcal{c}}}}\left(\Theta \right)=\log {{\rm{P}}}\left({{\mathcal{D}}}|\Theta \right)= \, \mathop {\sum }\limits_{i}\left[{\mathbb{I}}\left({z}_{i}=1\right)\left({{\rm{l}}}{{\rm{og}}}{{\rm{\pi }}}+\log {{\mathcal{N}}}\left({{{\mathcal{d}}}}_{i}|{\mu }_{1},{\sigma }_{1}^{2}\right)\right)\right. \\ +\left.{\mathbb{I}}\left({z}_{i}=2\right)\left({{\rm{l}}}{{\rm{og}}}(1-{{\rm{\pi }}})+\log {{\mathcal{N}}}\left({{{\mathcal{d}}}}_{i}|{\mu }_{2},{\sigma }_{2}^{2}\right)\right)\right]\\ +{{\rm{l}}}{{\rm{og}}}{{\rm{Beta}}}\left(\pi |a,b\right)+\mathop {\sum }\limits_{k=1}^{2}\log {{\rm{NIX}}}\left({\mu }_{k},{\sigma }_{k}^{2}|{m}_{0},{\kappa }_{0},{s}_{0}^{2},{\nu }_{0}\right),$$
(14)
where, \({{{\rm{z}}}}_{i}\) denotes the component membership of spot \(i\). In the \(t\)-th iteration of the E-step, the expected sufficient statistics \({\bar{{z}_{i}}}^{(t)}\) is derived from \({\Theta }^{(t-1)}\). In the subsequent M-step, \({\Theta }^{(t-1)}\) is updated to \({\Theta }^{(t)}\) by maximizing the auxiliary function \(Q\big(\Theta,{\Theta }^{(t-1)}\big)={\mathbb{E}}\big[{{\ell}}_{{{\mathcal{c}}}}\big(\Theta \big)\big({\Theta }^{(t-1)}\big)\big]\). Refer to Supplementary Note 1.4 for details about the model inference.Multimodal learning of spatial gene expression data and histology imageTo effectively integrate and harness spatial gene expression data and the associated histology images, STANDS generates spot embeddings from both data types, which are subsequently fused into multimodal embeddings for each spot using a TF block. A histology image is first segmented into patches centered around each spatial spot, adhering to the methodology outlined by Pang et al.40. The image patch for spot \(i\), represented as \({x}_{i}^{p}\in {{\mathbb{R}}}^{3\times W\times H}\), is processed through a pre-trained ResNet to yield initial embeddings \({\bar{x}}_{i}^{p}\in {{\mathbb{R}}}^{v}\), where \(\mbox{(”)} 3\hbox{”}\) indicates the number of channels (RGB), and \(W\) and \(H\) denote the patch width and height, respectively. Meanwhile, we convert the locations of spatial spots into an undirected neighborhood graph \(G=\left(V,E\right)\) with a pre-defined neighbor number \(k\), where \(V\) and \(E\) represent the spot and edge sets, respectively. In our implementation, \(k\) is set to be 6. The adjacency matrix \({{\bf{A}}}\in {{\mathbb{R}}}^{{N}_{{spot}}\times {N}_{{spot}}}\) of \(G\) is defined as:$${{{\rm{A}}}}_{i,j}=\left\{\begin{array}{c}1,{{\rm{if}}}j\in {N}_{k}\left(i\right)\\ 0,\, {{\rm{otherwise}}}\end{array}\right.,$$
(15)
$$\widetilde{{{\bf{A}}}}={{\bf{I}}}+{{\bf{A}}},$$
(16)
where \({N}_{k}\left(i\right)\) represents the set of proximity-based \(k\)-nearest neighbors of spot \(i\). Here, \(\widetilde{{{\bf{A}}}}\) extends \({{\bf{A}}}\) by adding self-loops and is utilized by the GATv241, a two-layer graph attention auto-encoder, in the generation of transcriptomic and image embeddings. For instance, we calculate an attention score \({{{\rm{\alpha }}}}_{i,j}^{\left(l\right)}\) between a given spot \(i\) and its neighbor \(j\) on the \(l\)-th encoder layer using the formula:$${{{\rm{\alpha }}}}_{i,j}^{\left(l\right)}=\frac{\exp \left({{{\bf{W}}}}_{{att}}^{\left(l\right)}{{\rm{LeakyReLU}}}\, \left({{{\bf{W}}}}^{\left(l\right)}\left[{z}_{i}^{\left(l-1\right)}{{\rm{||}}}{z}_{j}^{\left(l-1\right)}\right]\right)\right)}{{\sum }_{{j}^{{\prime} }\in N\left(i\right)}\exp \left({{{\bf{W}}}}_{{att}}^{\left(l\right)}{{\rm{LeakyReLU}}}\,\left({{{\bf{W}}}}^{\left(l\right)}\left[{z}_{i}^{\left(l-1\right)}{{\rm{||}}}{z}_{{j}^{{\prime} }}^{\left(l-1\right)}\right]\right)\right)},$$
(17)
where, \({z}_{i}^{\left(l\right)}\in {{\mathbb{R}}}^{d}\) is spot \(i\)’s embedding generated by the \(l\)-th encoder layer. The matrix \({{{\bf{Z}}}}^{\left(l\right)}\in {{\mathbb{R}}}^{{N}_{{spot}}\times d}\), which compiles all \({z}_{i}^{\left(l\right)},\forall i\in \left[1,{N}_{{spot}}\right]\), is formulated as:$${{{\bf{Z}}}}^{\left(l\right)}={{\rm{\sigma }}}\left({\widetilde{{{\bf{A}}}}}^{T}{{{\bf{W}}}}_{a}^{\left(l\right)}{{{\bf{Z}}}}^{\left(l-1\right)}{\left({{{\bf{W}}}}^{\left(l\right)}\right)}^{T}\right),\, l\in \left\{1,\, 2\right\},$$
(18)
where, \({{{\bf{W}}}}_{a}^{\left(l\right)}\) stores all \({{{\rm{\alpha }}}}_{i,j}^{\left(l\right)}\), and \({{\rm{\sigma }}}\) denotes a nonlinear activation function. Note that spot \(i\)’s initial embedding for the \(0\)-th layer, \({z}_{i}^{\left(0\right)}\), depends on the data type: \({x}_{i}^{g}\) for transcriptomic data and \({\bar{x}}_{i}^{p}\) for histology image. The transcriptomic and image embeddings outputted by the GATv2 encoder, denoted as \({{{\bf{Z}}}}^{g}\in {{\mathbb{R}}}^{{N}_{{spot}}\times d}\) and \({{{\bf{Z}}}}^{p}\in {{\mathbb{R}}}^{{N}_{{spot}}\times d}\) respectively, are concatenated into \({{{\bf{Z}}}}^{{concat}}\in {{\mathbb{R}}}^{{N}_{{spot}}\times 2d}\). This concatenated matrix serves as the input to a multi-head transformer block for data fusion. To elaborate, let \(m\) denote the number of attention heads such that \(2d\) is divisible by \(m\). \({{{\bf{Z}}}}^{{concat}}\) is split into \(m\) sub-embeddings \({{{\bf{Z}}}}_{1},\cdots,{{{\bf{Z}}}}_{m}\in {{\mathbb{R}}}^{{N}_{{spot}}\times 2d/m}\) followed by their mappings to the corresponding query, key and value matrices (\({{{\bf{Q}}}}_{t},{{{\bf{K}}}}_{t},{{{\bf{V}}}}_{t}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{{N}_{{spot}}\times d}\)) as:$${{{\bf{Q}}}}_{t} \,=\,{{{{\bf{Z}}}}_{t}{{\bf{W}}}}_{t}^{Q},\, {{{\bf{K}}}}_{t}=\,{{{{\bf{Z}}}}_{t}{{\bf{W}}}}_{t}^{K},\, {{{\bf{V}}}}_{t} \,=\,{{{{\bf{Z}}}}_{t}{{\bf{W}}}}_{t}^{V},\, \forall t\in \left[1,\,m\right],$$
(19)
where, \({{{\bf{W}}}}_{t}^{Q},\,{{{\bf{W}}}}_{t}^{K},\,{{{\bf{W}}}}_{t}^{V}\in {{\mathbb{R}}}^{\left(2d/m\right)\times d}\) are trainable weight matrices. The output embeddings \({{{\bf{H}}}}_{t}\in {{\mathbb{R}}}^{{N}_{{spot}}\times d}\) from attention head \(t\) are calculated as:$${{{\bf{H}}}}_{t}={{\rm{softmax}}}\left(\frac{{{{\bf{Q}}}}_{t}{{{\bf{K}}}}_{t}^{T}}{\sqrt{d}}\right){{{\bf{V}}}}_{t},$$
(20)
which essentially is an enhanced representation of a subspace of \({{{\bf{Z}}}}^{{concat}}\). Finally, the output embeddings from all attention heads are fused into the final spot embeddings \({{\bf{Z}}}\in {{\mathbb{R}}}^{{N}_{{spot}}\times 2d}\) as:$${{\bf{Z}}}=\left[{{{\bf{H}}}}_{1}\parallel {{{\bf{H}}}}_{2}\parallel \cdots \parallel {{{\bf{H}}}}_{m}\right]{{{\bf{W}}}}^{o},$$
(21)
where, \({{{\bf{W}}}}^{o}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{{md}\times 2d}\) represents the trainable weight matrix for fusing \({{{\bf{H}}}}_{t},\forall t\in \left[1,\, m\right]\).Multi-sample ST data alignmentAs illustrated in the C2 part of Fig. 1b, this task proceeds in two steps: Initially, each normal spot in target datasets is paired with its most similar spot in the reference dataset, forming a “kin” pair indicative of shared biological contents. Subsequently, based on these “kin” pairs, STANDS learns a “style-divergence” matrix which encodes the batch divergences between the target and reference datasets in its rows. This matrix allows the mapping of target datasets to the reference data space in a “style-transfer” manner.In the first step, the generator of module II learns to reconstruct the embeddings of target spots using those of reference spots, while the discriminator learns to distinguish between the authentic and generated spots. Specifically, let \({{{\bf{Z}}}}_{T}\in {{\mathbb{R}}}^{{N}_{T}\times d},{{{\bf{Z}}}}_{R}\in {{\mathbb{R}}}^{{N}_{R}\times d}\) denote the transcriptomic embeddings of target and reference spots, respectively. \({N}_{T}\) and \({N}_{R}\) denote the number of target and reference spots, respectively. \({{\bf{M}}}\in {{\mathbb{R}}}^{{N}_{T}\times {N}_{R}}\) denotes a trainable non-negative mapping matrix. The generator reconstructs \({\hat{{{\bf{Z}}}}}_{T}\) as:$${\hat{{{\bf{Z}}}}}_{T} \,=\, {{\rm{ReLU}}}\left({{\bf{M}}}{\odot}\bar{{{\bf{A}}}}\right){{{\bf{Z}}}}_{R},$$
(22)
$${\bar{{{\bf{A}}}}}_{i,j} \,=\left\{\begin{array}{c}1,\, {{\rm{if}}}j\in {N}_{k}\left(i\right)\\ 0,\, {{\rm{otherwise}}}\end{array}\right.,$$
(23)
where ReLU function imposes a non-negative constraint on \({{\bf{M}}}\). \(\bar{{{\bf{A}}}}{{\boldsymbol{\in }}}{{\mathbb{R}}}^{{N}_{T}\times {N}_{R}}\) is a kNN adjacency matrix that integrates spatial neighborhood information into the calculation. \({N}_{k}\left(i\right)\) represents the set of spot \(i\)’s k-nearest neighboring reference spots. The loss functions of the generator and discriminator in module II are given by:$${{{\mathcal{L}}}}_{{{\rm{G}}}}=\alpha {\mathbb{E}}{\|{{{\bf{Z}}}}_{T}-{\hat{{{\bf{Z}}}}}_{T}\|}_{1}-\beta {\mathbb{E}}\left[D\left({\hat{{{\bf{Z}}}}}_{T}\right)\right],$$
(24)
$${{{\mathcal{L}}}}_{D}={\mathbb{E}}\left[D\left({\hat{{{\bf{Z}}}}}_{T}\right)\right]{\mathbb{-}}{\mathbb{E}}\left[D\left({{{\bf{Z}}}}_{T}\right)\right]+\lambda {\mathbb{E}}\left[{\left({\|\nabla D\left({\widetilde{{{\bf{Z}}}}}_{T}\right)\|}_{2}-1\right)}^{2}\right],$$
(25)
where \(\widetilde{{{\bf{Z}}}}= \epsilon \hat{{{\bf{Z}}}}+\left(1-\epsilon \right){{\bf{Z}}},\epsilon \in \left({\mathrm{0,1}}\right)\), and \(\alpha,\beta,\lambda \ge 0\) represent the weights of the loss terms. After training, the column index of the maximum value in the \(i\)-th row of \({{\bf{M}}}\) points to the reference spot that is “kin” to the \(i\)-th target spot.The two spots of a “kin” pair are presumed to share similar biological contents so that the reference spot can be approximated by removing the “style-divergence” (batch variations) from the target spot. Therefore, in the second step, a “style”-transfer GAN (module III) is employed to learn the “style-divergences” between target and reference datasets as a matrix \({{\bf{S}}}\in {{\mathbb{R}}}^{{N}_{{batch}}\times d}\). Specifically, for each target spot \(i\), the encoder within the generator of module III maps the gene expression vector \({x}_{i}\) to a latent embedding \({z}_{i}\in {{\mathbb{R}}}^{d}\). This encoder and the one within module I share the same network architectures but are trained independently, with the former initialized using the latter’s trained weights. Here, \({z}_{i}\) approximates the embedding of its “kin” reference spot \(j\) as follows:$${z}_{i} \,=\, {f}_{{GAT}}\left({x}_{i}^{g},{G}_{i},\,{{\bf{W}}}\right),$$
(26)
$${\hat{z}}_{i} \,=\, {z}_{i}-{{{\bf{S}}}}^{T}{b}_{i},$$
(27)
where \({b}_{i}\in {{\mathbb{R}}}^{{N}_{{batch}}}\) denotes spot \(i\)’s one-hot batch identity vector, and \({G}_{i}\) is the graph representation of the dataset containing spot \(i\). The generator’s decoder then reconstructs \({\hat{x}}_{i}^{g}\) from \({\hat{z}}_{i}\), while the discriminator of module III learns to distinguish between \({x}_{i}^{g}\) and \({\hat{x}}_{i}^{g}\). The loss functions for the generator and discriminator are:$${{{\mathcal{L}}}}_{G} \,=\, \alpha {\mathbb{E}}{\|{x}_{R}^{g}-{\hat{x}}_{R}^{g}\|}_{1}-\beta {\mathbb{E}}\left[D\left({\hat{x}}_{R}^{g}\right)\right],$$
(28)
$${{{\mathcal{L}}}}_{D} \,=\,{\mathbb{E}}\left[D\left({\hat{x}}_{R}^{g}\right)\right]{\mathbb{-}}{\mathbb{E}}\left[D\left({x}_{R}^{g}\right)\right]+\lambda {\mathbb{E}}\left[{\left({\|\nabla D\left({\widetilde{x}}_{R}^{g}\right)\|}_{2}-1\right)}^{2}\right],$$
(29)
where \(\alpha,\beta,\lambda\) and \({\widetilde{x}}_{R}\) mirror their counterparts in module II. By passing through the trained generator of module III, spots across multiple target datasets are allowed to be collectively aligned in the common reference data space.Subtyping anomalous tissue domains across multiple datasetsInitially, identified anomalous spots across multiple target datasets are aligned by module III in the common reference space, effectively reducing the confounding batch variations in anomaly subtyping. Then, as illustrated in the C3 part of Fig. 1b, the embedding and reconstruction residual of each aligned anomalous spot are fused into a comprehensive embedding that is informative on anomaly subtypes. Specifically, for a given anomalous spot \(i\), let \({x}_{i}^{g}\) and \({x}_{i}^{p}\) denote its aligned gene expression and image patch vectors, respectively; \({\hat{x}}_{i}^{g}\) and \({\hat{x}}_{i}^{p}\) denote the reconstructed vectors from \({x}_{i}^{g}\) and \({x}_{i}^{p}\), respectively; \({r}_{i}^{g}\) and \({r}_{i}^{p}\) denote the reconstruction residuals of gene expression and image patch vectors, respectively; \({z}_{i}\) and \({{{\rm{\zeta }}}}_{i}\) represent the module I-generated embeddings of \({x}_{i}\) and \({r}_{i}\), respectively. Then, we have:$${r}_{i} \,=\, \left[{r}_{i}^{g}{{\rm{||}}}{r}_{i}^{p}\right]=\left[\left({x}_{i}^{g}-{\hat{x}}_{i}^{g}\right){{\rm{||}}}\left({x}_{i}^{p}-{\hat{x}}_{i}^{p}\right)\right],$$
(30)
$${z}_{i} \,=\, \left[{f}_{{GAT}}\left({x}_{i}^{g},G,\,{{{\bf{W}}}}_{1}\right){{\rm{||}}}{f}_{{rn}-{GAT}}\left({x}_{i}^{p},G,\,{{{\bf{W}}}}_{2}\right)\right],$$
(31)
$${{{\rm{\zeta }}}}_{i} \,=\, \left[{f}_{{GAT}}\left({r}_{i}^{g},\, G,\,{{{\bf{W}}}}_{1}\right){{\rm{||}}}{f}_{{rn}-{GAT}}\left({r}_{i}^{p},\, G,\,{{{\bf{W}}}}_{2}\right)\right],$$
(32)
$${z}_{i}^{*} \,=\,{TF}\left(\left[{z}_{i}{{\rm{||}}}{{{\rm{\zeta }}}}_{i}\right],\,{{{\bf{W}}}}_{{tf}}\right),$$
(33)
where, \({z}_{i}^{*}\) represents the fused embedding of anomaly \(i\). DEC42, a discriminatively boosted clustering algorithm, groups anomalies into clusters based on their \({z}^{*}\). It applies a Cauchy kernel to \({z}_{i}^{{\prime} }\) to calculate the soft assignment score (\({q}_{i,j}\)) of anomaly \(i\) to a cluster \(j\) as:$${q}_{i,j} \,=\frac{{\left(1+\frac{{\|{z}_{i}^{*}-{\mu }_{j}\|}^{2}}{v}\right)}^{-1}}{{\sum }_{{j}^{{\prime} }}{\left(1+\frac{{\|{z}_{i}^{*}-{\mu }_{{j}^{{\prime} }}\|}^{2}}{v}\right)}^{-1}},$$
(34)
where, \({\mu }_{j}\) denotes the centroid of cluster \(j\), \(v\) the degree of freedom of the Cauchy kernel. The clustering loss function \({{\mathcal{L}}}\) is based on the KL-divergence between \(q\) and an auxiliary target distribution \(p\), defined as:$${p}_{i,j} \,=\frac{\frac{{q}_{i,j}^{2}}{{\sum }_{i}{q}_{i,j}}}{{\sum }_{j}\left(\frac{{q}_{i,j}^{2}}{{\sum }_{i}{q}_{i,j}}\right)},$$
(35)
$${{\mathcal{L}}}={\sum}_{i}{\sum}_{j}{p}_{i,j}\log \left(\frac{{p}_{i,j}}{{q}_{i,j}}\right).$$
(36)
Essentially, anomalies with high-confident assignment are overweighed in the distribution \(p\). In practice, the iterative updating of \({{{\bf{W}}}}_{{tf}}\) and \(\mu\), aiming to minimize \({{\mathcal{L}}}\), nudges \(q\) toward \(p\) and incrementally transforms harder-to-cluster embeddings \({z}^{*}\) into easier ones. This self-paced clustering continues until the changes in anomalies’ hard assignments fall below a threshold or a predetermined number of iterations is reached. The resultant hard cluster assignments of anomalous spots correspond to their subtype labels. The number of clusters is assumed to be known or automatically inferred as described in Supplementary Note 1.5.Model architecture and trainingAnomalous tissue domain detectionHere, GAN module I is first trained on the reference dataset and then applied to the target data, generating reconstruction errors as anomaly scores for each target spot. GAN module I comprises a generator and a discriminator. During the training of GAN module I, we set a mini batch size of 128 and utilize the Adam optimizer with a learning rate of 3e-4.The generator is further divided into an encoder, a memory bank, and a decoder. When using an ST dataset as reference, the encoder is a two-layer GAT of an architecture of 3000-512-256, with four 128-dimensional attention heads in the first layer and a single 256-dimensional attention head in the second layer. When histology data is available (e.g., for 10x Visium datasets), a pretrained ResNet-3443 is used to extract 256-dimensional visual features from 112 × 112 pixel image patches that are segmented from the histology image and centered around each spatial spot. These visual features are further encoded by another two-layer GAT, with the same architecture as used for encoding ST data, to capture the spatial relationships among neighboring patches. The 256-dimensional image and gene expression embeddings are then fused using a TF block comprising three transformer encoder layers, each with four 128-dimensional attention heads, to output 512-dimensional fused embeddings. Note that the encoder branch for visual features and the TF block is omitted in the absence of histology image. When cross-referencing an scRNA-seq dataset, a two-layer MLP network with an architecture of 3000-512-256 replaces the GAT in the encoder to generate gene expression embeddings at each spot. The batch (128) of embeddings output from the encoder is subsequently enqueued into the memory bank, which has a size of 512×512 for multimodal embeddings and 512 × 256 for single-modal embeddings, while an equal number of the oldest embeddings in the bank are dequeued. After memory bank-mediated embedding reconstruction, 256-dimensional single-modal gene expression embeddings are input to the decoder, while 512-dimensional multimodal embeddings are split into 256-dimensional image and gene expression embeddings before being fed into their respective decoders. The decoder for ST data is a two-layer MLP with an architecture of 256-512-3000, and the decoder for image data is a ResNet-34 decoder symmetric to the ResNet encoder, comprising transposed convolutional layers.The discriminator, comprising an encoder and a four-layer MLP-based classifier, accepts pairs of original and reconstructed data. The architecture of its encoder mirrors that of the generator’s encoder, and the classifier has an architecture of 512-256(x3)−16 in the presence of image data or 256(x4)−16 otherwise. The discriminator is trained to maximize the L1-norm difference between the 16-dimensional output embeddings of the original and reconstructed data.Multi-sample ST data alignmentInitially, plausible anomalous spots identified by GAN-module I are excluded from the target datasets to minimize their confounding effects during alignment. GAN module II’s generator processes gene expression embeddings of both reference and target spots generated by GAN module I’s encoder, training a non-negative mapping matrix to reconstruct the target embeddings from the reference embeddings. The discriminator is a four-layer MLP with an architecture of 512-256(x3)−16, aimed at maximizing the L1-norm difference between the original and reconstructed target embeddings. Once trained, the non-negative mapping matrix is utilized to identify kin pairs of reference and target spots, whose raw data are then input into GAN module III.The encoders and decoders of the generator within GAN module III and module I share the same architectures but are trained independently, with the former initialized with the latter’s trained weights. GAN module III trains a matrix \(S \sim {n}_{b}\times 256\) that encompasses \({n}_{b}\) style embeddings representing various batch effects. These style embeddings are subtracted from their corresponding target spots’ embeddings to map the target spots to the common embedded reference space so as to transfer target datasets’ styles to the reference dataset’s. For each target spot, the generator’s decoder use its “style-transferred” embedding to reconstruct its kin reference spot, which is then paired with the original data as inputs to the discriminator whose architecture mirrors that of the discriminator in GAN module I. Lastly, the training of this module adopts a batch size of 128 and the Adam optimizer with a learning rate 3e-4.Anomaly subtypingAnomalous spots identified by the GAN module I are aligned in the reference data space using the trained GAN module III. Then, the frozen encoder and decoder from GAN module I are used to generate post-alignment embeddings of identified anomalous spots and their reconstruction errors, respectively. Using a specific encoder that mirrors the encoder of GAN module I’s generator but is trained independently, reconstruction errors are further converted into embeddings with same dimensions as the spot embeddings. The spot and reconstruction error embeddings are fused into 128-dimensional embeddings using a trainable TF block, consisting of three transformer layers, each with multiple 128-dimensional attention heads. Specifically, there are eight attention heads in a transformer layer when using multimodal data or four heads otherwise. The outputs are subsequently subjected to self-paced discriminatively boosted clustering. The training process iterates between clustering and fused embedding generation until the changes in anomalies hard assignments fall below a threshold (0.001) or a number of iterations (2e4) is reached.Data preprocessingIn this study, we follow the standard pipeline of data preprocessing provided by the Scanpy44, SpatialDE45 and GeneClust46 packages. Specifically, mitochondrial and External RNA Controls Consortium (ERCC) spike-in genes are removed. Genes detected in fewer than 10 spots are excluded. We do not perform filtering on spatial spots to maintain spatial data integrity. Gene expression counts matrix are normalized by library size and then log-transformed. Finally, we select the top 3000 spatially variable genes (SVG) selected using SpatialDE as inputs to the STANDS.Evaluation metricsAnomalous tissue domain detection
Spatial grouping discrepancy (SGD)
We propose the SGD, a novel metric to assess both the accuracy of labels and the consistency of spatial structures. Specifically, spatial locations are represented as nodes in an undirected graph. Normal spots are isolated, while anomalous spots are connected to their k-nearest anomalous neighbors. Note that in the anomaly detection results, incorrectly identified spots as anomalies (false positives) become connected, and false negatives become isolated, which leads to a deviation from the local structures of the ground truth graph. Spots are divided into two regions: one includes true positives plus false positives (TP + FP) anomalies, and the other includes true positives plus false negatives (TP + FN) anomalies. We perform a bootstrap sampling of \(m\) sets of spots from these two regions, generating a collection \(S=\left\{{s}_{i}:\left\{{s}_{i}^{\left(1\right)},{s}_{i}^{\left(2\right)}\right\},\forall i\in \left[1,m\right]\right\}\). Subsequently, both cluster coefficients and degrees for spots within \({s}_{i}\) are calculated as follows:$${{\rm{c}}}{c}_{i,j}^{\left(r,\, l\right)}=\frac{2{E}_{j}}{{k}_{j}\left({k}_{j}-1\right)},$$
(37)
$${d}_{j}^{\left(r,\, l\right)}={\sum}_{n}^{{N}_{k}\left(j\right)}{\mathbb{I}}\left({e}_{j,n}=1\right),$$
(38)
for all \(i\in \left[1,m\right]\), every spot \(j\) in \({s}_{i}^{\left(r\right)},\) region \(r\in \left\{1:{TP}+{FP},2:{TP}+{FN}\right\}\), and label type \(l\in \left\{1:{{\rm{ground}}}{{\rm{truth}}},2:{{\rm{anomaly}}}{{\rm{detection}}}{{\rm{outcomes}}}\right\}\). Here, \(c{c}_{i,j}^{\left(r,l\right)}\) and \({d}_{j}^{\left(r,l\right)}\) represent the cluster coefficient and degree of spot \(j\) within region \(r\) from bootstrap sample \(i\), based on either the ground truth (\(l=1\)) or anomaly detection outcomes (\(l=2\)). \({k}_{j}\) denotes the number of neighbors connected to spot \(j\), \({E}_{j}\) the number of edges among these neighbors, \({e}_{j,n}\) the edge between spots \(j\) and \(n\), \({N}_{k}\left(j\right)\) the set of k-nearest neighbors of spot \(j\). We adopt both degree and cluster coefficient metrics because they reflect the centrality and neighborhood connectivity of spots, respectively.
Next, for any two bootstrap samples \(i\) and \(j\), we quantify the discrepancy in the distribution of their cluster coefficients or degrees using the Wasserstein distance, \(W({p}_{i,t},{p}_{j,t})\), defined as:$$W\left({p}_{i,t}^{\left(r\right)},\, {p}_{j,t}^{\left(r\right)}\right)={\inf }_{{{\rm{\gamma }}}\in \Pi \left({p}_{i,t}^{\left(r\right)},\, {p}_{j,t}^{\left(r\right)}\right)}{{\mathbb{E}}}_{\left(x,y\right) \sim {{\rm{\gamma }}}}\left[\|x-y\|\right],$$
(39)
$$W\left({p}_{i,t},\, {p}_{j,t}\right)={TPR}\times W\left({p}_{i,t}^{\left(1\right)},\, {p}_{j,t}^{\left(1\right)}\right)+\left(1-{TPR}\right)\times W\left({p}_{i,t}^{\left(2\right)},\, {p}_{j,t}^{\left(2\right)}\right),$$
(40)
$${TPR}=\frac{{TP}}{{TP}+{FN}},$$
(41)
for every \(t\in \left\{1:{{\rm{degree}}},\, 2:{{\rm{cluster\; coefficient}}}\right\},\) and \(r\in \left[1:{TP}+{FP},2:{TP}+{FN}\right]\). Here, \({p}_{i,t}^{\left(r\right)}\) denotes the distribution of \(t\) for region \(r\) in the \(i\)-th bootstrap sample, and \(\Pi \left(p,q\right)\) represents the set of all joint distributions with marginals \(p\) \({and}\) \(q\), respectively. \(\gamma\) denotes a valid transport plan between these distributions. To capture high-order moments of distributional discrepancy, we apply a Gaussian-like kernel to the Wasserstein distance:$${{\mathcal{S}}}\left({p}_{i,t},\, {p}_{j,t}\right) \,=\, \exp \left(-\frac{W\left({p}_{i,t},\, {p}_{j,t}\right)}{\tau }\right),$$
(42)
where \(\tau\) is a positive temperature hyperparameter. The Moore-Aronszajin theorem guarantees that this symmetric and positive-definite kernel induces a unique Reproducing Kernel Hilbert Space (RKHS)47. Finally, we define SGD metrics as Maximum Mean Discrepancy (MMD) scores for the metric in this RKHS:$${{\rm{SGD}}}\left({p}_{{true},t}{{\rm{||}}}{p}_{{detect},t}\right)= \, {{\mathbb{E}}}_{{p}_{i,t},{p}_{{i}^{{\prime} },t} \sim {p}_{{true},t}}\left[{{\mathcal{S}}}\left({p}_{i,t},{p}_{{i}^{{\prime} },t}\right)\right]\\ +{{\mathbb{E}}}_{{p}_{j,t},{p}_{{j}^{{\prime} },t} \sim {p}_{{detect},t}}\left[{{\mathcal{S}}}\left({p}_{j,t},{p}_{{j}^{{\prime} },t}\right)\right] \\ -2{{\mathbb{E}}}_{{p}_{i,t} \sim {p}_{{true},t},{p}_{j,t} \sim {p}_{{detect},t}}\left[{{\mathcal{S}}}\left({p}_{i,t},{p}_{j,t}\right)\right],$$
(43)
where \({p}_{{true},t}\) and \({p}_{{detect},t}\) represent the sets of distributions for metric \(t\) (degree or cluster coefficient) derived from bootstrap samples in the contexts of ground truth and anomaly detection outcomes, respectively. Based on SGD, we further propose multi-SGD to measure the spatial discrepancy between spatial clustering results and ground truth that involves multiple domain types (see “Multi-type spatial grouping discrepancy” section below).
Multi-sample ST data alignmentThe performance of multi-sample alignment is assessed using multiple metrics calculated on reduced t-SNE embeddings of aligned datasets. These metrics include integration local inverse Simpson’s index (iLISI)33, BatchKL25, and ASW_batch48 for evaluating batch mixing effects, and ASW_type48 for evaluating cross-batch domain (or spot) type alignment. Additionally, ARI is used to evaluate spatial clustering performed on aligned datasets.iLISI. This metric measures the effective number of batches present in the local neighborhoods of spots across aligned datasets by calculating a score that represents the degree of local batch mixing. The score value ranges from 1 to \({N}_{{batch}}\), with a higher value indicating more effective batch mixing. To elaborate, a neighboring spot probability matrix is calculated as: $${p}_{i,j}=\left\{\begin{array}{c}0,{{\rm{if}}}j \, \notin \, {N}_{k}\left(i\right)\hfill \\ \frac{\exp \left(-\beta {\|{x}_{i}-{x}_{j}\|}_{2}^{2}\right)}{{\sum }_{j}\exp \left(-\beta {\|{x}_{i}-{x}_{j}\|}_{2}^{2}\right)},\quad{{\rm{if}}}j\in {N}_{k}\left(i\right)\end{array}\right.$$
(44)
where \(i,j\in [1,{N}_{{spot}}]\), and \({x}_{i}\) denotes the reduced t-SNE embeddings of spot \(i\), \({N}_{k}\left(i\right)\) the set of proximity-based \(k\)-nearest neighbors of spot \(i\), and \({p}_{i,j}\) the probability that spot \(i\) is aligned to spot \(j\). Then, the iLISI score is calculated as:$${{\rm{iLISI}}} \,=\, \frac{1}{{N}_{{spot}}}{\sum }_{i=1}^{{N}_{{spot}}}{\left({p}_{i}^{T}{{\bf{B}}}{{{\bf{B}}}}^{T}{p}_{i}\right)}^{-1},$$
(45)
where, \({{\bf{B}}}={\left({b}_{1},\, {b}_{2},\cdots,\, {b}_{n}\right)}^{T}\in {{\mathbb{R}}}^{{N}_{{spot}}\times {N}_{{batch}}}\) represents a batch-identity matrix and \({b}_{i}\) represent the one-hot batch-identity vector of spot \(i\).BatchKL. This metric assesses the effectiveness of batch correction by calculating mixing Kullback-Leibler (KL) divergences. It reflects the batch diversity across aligned datasets, with a lower value indicating more effective batch mixing. Initially, 100 spots are randomly sampled from all batches, followed by the calculation of the regional mixing KL divergence as:$${{\rm{BatchKL}}} \,={\sum }_{b=1}^{B}{p}_{b}\log \frac{{p}_{b}}{{q}_{b}},$$
(46)
where, \({q}_{b}\) represents the proportion of spots from batch \(b\) in the entire sample, while \({p}_{b}\) represents the average proportion of spots from batch \(b\) within the \(k\)-nearest neighborhood of each sampled spot in the reduced t-SNE space.ASW_batch &ASW_type. The two metrics represent the average silhouette width of aligned spots based on their batch identities (ASW_batch) and domain types (ASW_type). A higher silhouette coefficient implies that observations within identical groups form compact clusters, while those belonging to different groups are well-separated. Therefore, a lower ASW_batch score indicates more effective batch mixing, while a higher ASW_type score indicates more accurate cross-batch domain (or spot) type alignment.Adjusted Rand Index (ARI). This metric assesses the spatial clustering results, with a higher value indicating more consistent clustering with the ground truth. Let \(n\) represents the total number of spots, \({n}_{{ij}}\) the number of spots of type \(i\) within cluster \(j\), \({a}_{i}\) the total number of spots of type \(i\), \({b}_{j}\) the total number of spots within cluster \(j\). Then ARI is calculated as:$${{\rm{ARI}}} \,=\, \frac{{\sum }_{{ij}}\left(\begin{array}{c}{n}_{{ij}}\\ 2\end{array}\right)\left[{\sum }_{i}\left(\begin{array}{c}{a}_{i}\\ 2\end{array}\right){\sum }_{j}\left(\begin{array}{c}{b}_{i}\\ 2\end{array}\right)\right]/\left(\begin{array}{c}n\\ 2\end{array}\right)}{\frac{1}{2}\left[{\sum }_{i}\left(\begin{array}{c}{a}_{i}\\ 2\end{array}\right)+{\sum }_{j}\left(\begin{array}{c}{b}_{j}\\ 2\end{array}\right)\right]-\left[{\sum }_{i}\left(\begin{array}{c}{a}_{i}\\ 2\end{array}\right){\sum }_{j}\left(\begin{array}{c}{b}_{j}\\ 2\end{array}\right)\right]/\left(\begin{array}{c}n\\ 2\end{array}\right)}.$$
(47)
Anomaly subtypingMulti-type spatial grouping discrepancy (multi-SGD)This metric assesses the consistency between the anomaly subtyping outcomes and the ground truth subdomain labels, taking into account the spatial relationships among spots. For \(\kappa > 2\) subtypes, we adopt the One-vs-Rest methodology to calculate an SGD score for each subtype, as detailed in the “Anomalous tissue domain detection” section. This involves mapping annotations generated by the subtyping method to the ground truth annotations using the COIN-OR Branch and Coin solver49 to solve the following mixed-integer programming problem:$${J}_{t}\left({y}_{i,j}\right) \,=\,{\min }_{{y}_{i,j}}{\sum }_{i=1}^{\kappa }{\sum }_{j=1}^{\kappa }{y}_{i,j}\frac{{N}_{i}}{{N}_{{total}}}{{\rm{SGD}}}\left({p}_{i,{true} \,,\,t}{{\rm{||}}}{p}_{j,{sub},\, t}\right),$$
(48)
$${{\rm{s}}}.{{\rm{t}}}.{\sum }_{i=1}^{\kappa }{y}_{i,j}=1,{\sum }_{j=1}^{\kappa }{y}_{i,j}=1,\,{y}_{i,j}\in \left[0,1\right],$$
(49)
where \(t\in \left\{{{\rm{degree}}},{{\rm{cluster\; coefficient}}}\right\}.\) Here, \({y}_{i,j}=1\) indicates that the \(i\)-th anomalous subtype is mapped to the \(j\)-th subtyping annotation. \({N}_{i}\) denotes the number of spots belonging to subtype \(i\), \({N}_{{total}}\) the total number of anomalous spots. \({{\rm{SGD}}}\big({p}_{i,{true},t}{||}{p}_{j,{sub},t}\big)\) represents the subtype-specific SGD score for the \(i\)-th true subtype when mapped to the \(j\)-th generated annotation (refer to equation 1). Finally, the multi-SGD for metric \(t\) is determined as:$${{{\rm{SGD}}}}_{{multi}}\left({p}_{{true},t}{{\rm{||}}}{p}_{{sub},t}\right)={J}_{t}\left({y}_{i,j}\right).$$
(50)
Normalized Mutual Information (NMI). This nonnegative metric evaluates the consistency between clustering results with the ground truth, with a higher value indicating a more accurate clustering. NMI is defined as:$${{\rm{NMI}}}=\frac{2{\sum }_{{ij}}\frac{{n}_{{ij}}}{n}\log \left(\frac{n\times {n}_{{ij}}}{{a}_{i}\times {b}_{j}}\right)}{{\sum }_{i}\frac{{a}_{i}}{n}{{\rm{lo}}}{{\rm{g}}}\left(\frac{n}{{a}_{i}}\right)+{\sum }_{j}\frac{{b}_{j}}{n}{{\rm{lo}}}{{\rm{g}}}\left(\frac{n}{{b}_{j}}\right)},$$
(51)
where, \({n}_{{ij}}\) represents the number of true positives of anomaly type \(i\) within cluster \(j\), \({a}_{i}\) the total number of true positives of anomalous type \(i\), \({b}_{j}\) the total number of true positives within cluster \(j\).Benchmark methodsBenchmark overview
Anomalous tissue domain detection
For benchmarks of the ATD detection subtask, we select five supervised methods, including Spatial-ID, scPred, CHETAH, scmap, and CAMLU, as well as two unsupervised methods, including SCEVAN and CopyCAT. All methods, except Spatial-ID, are originally designed for detecting anomalous single cells in scRNA-seq data. Each supervised methods trains a classifier on an annotated reference scRNA-seq dataset, which is then applied to classify target spots into known types. Specifically, Spatial-ID employs a deep neural network (DNN) pretrained on the reference scRNA-seq dataset. During inference, it utilizes a variational graph autoencoder (VGAE) to yield spot embeddings, which are then fed into the DNN classifier to generate spot pseudo-labels. Meanwhile, it trains another self-supervised DNN classifier to predict spot types against their pseudo-labels, identifying spots with a maximum type assignment probability below a threshold as anomalies. scPred trains a support vector machine (SVM) classifier on the annotated reference using the most informative principal features selected via a Wilcoxon signed-rank test. This classifier is used to classify target cells into known types and identify those with maximum assignment probability below a threshold as anomalies. CHETAH builds a hierarchical classification tree from reference data, computing cell-type specific gene expression profiles at each tree node. During inference, target cells are classified by traversing the tree, with the traversal path determined based on their correlations (i.e., confidence scores) with gene expression profiles at intermediate tree nodes. Target cells with a confidence score below a threshold at the root node are deemed anomalous. scmap calculates gene expression profile similarities between target cells to cell type centroids in the reference dataset, assigning them to the type with highest similarity. Target cells with the highest similarity score below a threshold are identified as anomalous. The aforementioned supervised methods identify target cells with low assignment confidence as anomalies, which however increases the false positive risks due to confusing normal cells with uncertain assignment with genuine anomalies. Conversely, CAMLU is a reconstruction-based method that sidesteps the requirement for annotated reference. It trains an autoencoder to reconstruct genes in the reference dataset which is then applied to the target dataset, selecting genes that demonstrate the largest discrepancies between reconstruction errors of the reference and target datasets as discriminative features. These genes are utilized in a hierarchical clustering to categorize target cells as normal or anomalous.
Unlike the supervised benchmarks, the two unsupervised benchmarks, SCEVAN and CopyCAT, are directly applied to the target ST datasets. They both initially identify a set of highly confident normal cells to serve as a gene copy number baseline. Next, the gene copy number profiles of target cells are estimated from the baseline using a joint segmentation algorithm in SCEVAN and a Poisson-Gamma model in CopyCAT. Utilizing these profiles, both methods performs hierarchical clustering to group cells into clusters, identifying those significantly enriched in predefined normal cells as normal and others as anomalous.

Multi-sample ST data alignment
Benchmarks for the ATD alignment subtask include two well-established batch correction methods for scRNA-seq, ComBat and Harmony, alongside two recent methods for ST, GraphST and STAligner. ComBat utilizes a Bayesian framework with empirical priors to estimate and correct for both additive and multiplicative batch effects across samples. Harmony clusters cells in a low-dimensional embedded space, maximizing intra-cluster batch diversity, and then applies linear batch correction using the cluster centroids. GraphST is designed for aligning spatially adjacent ST datasets as it relies on PASTE50 to align histological images of adjacent datasets to acquire consensus spatial coordinates. It corrects batch effects by constructing a shared neighborhood graph that connects spatially adjacent spots across samples, reducing cross-sample batch variations through node feature smoothing in a self-supervised contrastive graph learning. STAligner integrates multiple ST datasets, whether spatially adjacent or not, into a single graph to yield spot embeddings using GAT, based on which positive and negative spot pairs are identified. Then batch variations are corrected by iteratively optimizing between spot embeddings and a contrastive learning triplet loss computed on the positive and negative pairs.

Anomaly subtyping
Three celebrated spatial clustering methods, including GraphST, STAGATE, and iStar, are chosen to benchmark ATD subtyping. GraphST models spatial gene expression using a graph, which is further augmented with a locally corrupted graph. Next, it employs a GCN to conduct a self-supervised contrastive learning between the two graphs, yielding spatial spot embeddings for spatial clustering. iStar utilizes a hierarchical vision transformer (HViT) pretrained on public histology images to yield both local and global image features for predicting super-resolution gene expressions using a weakly supervised feed-forward neural network (FFN). A k-means clustering is then performed using gene embeddings encoded by the penultimate layer of the FFN to cluster spots into different subtypes. STAGATE also models spatial gene expression using a graph and generates low-dimensional spot embeddings using a reconstruction-based GAT, which serve as input to an off-the-shelf clustering algorithm for subtyping.
Benchmark implementationsAnomalous tissue domain detectionThis series of experiments spans three scenarios: detecting ATDs from a single target dataset, from multiple target datasets, and cross-referencing scRNA-seq data. In the first (Exp ID = 1 and 4), second (Exp ID = 3, 5, 6, 7, and 8) and third (Exp ID = 2) scenarios, five supervised methods, including Spatial-ID, scPred, CHETAH, scmap, and CAMLU, are trained on the reference ST dataset, treating spatial spots as single-cells and domain types as cell types. These methods are then applied to the target datasets— scPred, CHETAH, scmap, and Spatial-ID compute an assignment confidence score to each target spot, labeling those below an implicitly specified threshold as anomalous. CAMLU, on the other hand, reconstructs the target datasets, selecting the top 500 feature genes exhibiting significant discrepancies in reconstruction errors compared to the reference dataset. Using these feature genes, target spots are clustered into normal and anomalous groups via a hierarchical clustering. During inference, all methods except Spatial-ID treat target spots as single-cells. Spatial-ID accounts for spatial relationships among target spots by incorporating their adjacency matrix during spot embedding generation. Additionally, in the third scenario, two unsupervised methods, i.e., SCEVAN and CopyCAT, are directly applied to target ST datasets, treating target spots as single-cells. Both methods perform hierarchical clustering to group spots into clusters based on their estimated gene copy number profiles. Clusters highly enriched in predefined highly confident normal spots in the enrichment analysis (P-value \(\le\) 0.05) are identified as normal and others as anomalous. In all experiments, benchmark methods adopt the default hyperparameter values, such as anomaly score thresholds, clustering algorithm parameters, and significance levels for statistical tests.Multi-sample ST data alignmentExperiments for this task involve three ST datasets, either vertical or non-adjacent. The benchmarks include two methods for aligning scRNA-seq, ComBat and Harmony, and two for aligning ST datasets, GraphST and STAligner. ComBat and Harmony are directly applied to the ST datasets, treating spatial spots as single-cells and disregarding their spatial relationships. In experiments involving vertical datasets, GraphST first utilizes PASTE to obtain consensus spatial coordinates, with which spatial spots across datasets are positioned in a common tissue space for alignment. In experiments involving nonadjacent datasets (Exp ID = 9, 10, 12, and 14), whose spatial coordinates are unalignable, GraphST utilizes consensus original spatial coordinates in the alignment process as a compromise. Conversely, STAligner can handle both vertical and nonadjacent ST datasets through contrastive learning with positive and negative pairs of anchor spots. All benchmarks are evaluated with parameter settings recommended by the original studies.Anomaly subtypingGiven the performance of ATD subtyping heavily depends on the quality of detected and aligned anomalous spots, we use composite methods comprising of methods specifically designed for each individual task as benchmarks. Experiments for this subtask involves either single or multiple target datasets.In the first scenario (Exp ID = 18-21), six composite benchmarks are constructed, including CAMLU-GraphST, scPred-GraphST, CHETAH-GraphST, scmap-GraphST, SpatialID-iStar, and SpatialID-STAGATE. The first method in each composite name identifies ATDs, while the second clusters them into subtypes. Since this scenario only involves single target dataset, data alignment method is unnecessary. The subtyping methods vary in implementations: GraphST and STAGATE convert target spots into embeddings and employ their built-in clustering algorithms to group the embedded anomalous spots identified by the first method into subtypes. iStar is excluded in experiments (Exp ID = 17 and 20) involving Slide-seqV2 and Stereo-seq datasets, both of which are devoid of histology images. In other experiments, image patches covering the identified anomalous spots are segmented and converted into visual features using a pretrained visual feature extractor. iStar is trained to predict the gene expression profile at each target spot using a DNN, with the penultimate layer’s outputs serving as input spots’ embeddings. iStar’s k-means clustering algorithm then utilizes embeddings of anomalous spots to group them into subtypes.The second scenario (Exp ID = 15-17 and 22-24) involves multiple target datasets, which necessitates data alignment methods. To meet this requirement, GraphST is added to the CAMLU-GraphST, scPred-GraphST, CHETAH-GraphST, and scmap-GraphST; ComBat to the SpatialID-iStar; and STAligner to the SpatialID-STAGATE. GraphST and STAligner align target datasets, generating post-alignment spot embeddings. The built-in clustering algorithms of GraphST and STAGATE use post-alignment embeddings of identified anomalous spots to cluster them into subtype groups. Since iStar trains the gene expression predictor using original gene expression data, ComBat, which preserves the original data scale post-alignment, is combined with SpatialID-iStar. This composite method are only used in experiments wherein datasets are associated with histology data (Exp ID = 15,16, 22, and 24). Specifically, Spatial-ID identifies anomalous target spots, iStar extracts visual features from the histology image associated with each target dataset, and ComBat aligns target datasets. Subsequently, iStar is trained on the target spots to predict their post-alignment gene expressions using the extracted visual features, in the meanwhile generating spot embeddings from the penultimate layer of the predictor. Finally, iStar’s built-in clustering algorithm cluster identified anomalous spots into subtypes using their spot embeddings. All component methods in the benchmarks adopt their default hyperparameter settings, and the true number of clusters is assumed to be known.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles