Graph attention automatic encoder based on contrastive learning for domain recognition of spatial transcriptomics

With the aim of fully utilizing the available information for spatial domain recognition, we propose a deep learning-based method named GAAEST, which aims to uncover the hidden relationship between gene expression and spatial location. An overview of our spatial domain recognition method is shown in Fig. 1. GAAEST consists of five main components: data pre-processing, neighbor graph construction, data augmentation, auto-encoder, self-supervised contrastive learning for embedding refinement, and spatial clustering. We provide a detailed description of each part below.Data pre-processingData pre-processing operation can provide high-quality data for subsequent GAAEST models. For GAAEST, the inputs are the gene expression matrix and spatial location information of spots. For the purpose of decreasing the difference between high and low expression genes and reducing the impact of noise data, we first perform a logarithmic transformation on the raw gene expression matrix and standardize it to library size. Then, the standardized gene expression matrix is scaled to unit variance and zero mean. At the same time, in order to focus more on genes that contribute significantly to cell changes, we select the first 3000 highly variable genes (HVGs) as input samples, which can effectively reduce dimensions and improve the efficiency of subsequent analysis.After pre-processing, the gene expression matrix is set as X = {x1,x2,…xN} ⊆ RN × F, where N represents the total number of spots, F represents the feature dimension of a spot.Neighbor graph construction and data augmentationThe spatial location information within ST data is valuable for assessing the similarity between spots, which helps to recognize spatial domains. In order to effectively utilize this information, we first construct a neighbor graph.To convert the spatial location information of spots into a neighbor graph G = (V, E), the k-nearest neighbor algorithm is adopted. Here, V represents the set of spots, E represents the set of edges connecting the spots and k is the number of neighbors. Concretely, for a specific spot i ∈ V, we calculate the euclidean distance between this spot and all other spots, and choose the k-closest spots as its neighbors. By analogy, the adjacency relationships between all spots are obtained, and an adjacency matrix A ∈ RN×N is used to denote the adjacency relationships of all spots in the neighbor graph G. If spot j ∈ V is the k-closest neighbor of spot i, then Aij = 1, otherwise 0. To address disparities in spot degrees and ensure that all spots have a more equal influence in calculations, we need to regularize the adjacency matrix A. Let \(\tilde{A}\in {R}^{N\times N}\) be the adjacency matrix after regularization, then$$\tilde{A}=\bar{A}\times A\times \bar{A}$$
(1)
$$\bar{A}={{\rm{diag}}}\left({{\rm{power}}}\left(\mathop{\sum }\limits_{j}^{N}{A}_{j},\,-0.5\right)\right)$$
(2)
where (×) is the meaning of cross product, diag means creating or extracting the diagonal matrix, and power represents the exponentiation.After constructing the neighborhood graph, we use a regularized adjacency matrix \(\tilde{A}\) and a preprocessed gene expression matrix X to represent the adjacency relationship and gene expression information of spots in the neighbor graph, respectively.Meanwhile, we conduct data augmentation operations to generate positive and negative samples for the subsequent self-supervised contrastive learning task. Specifically, given the neighbor graph G = (V, E), we perform a random permutation of the gene expression vectors between each spot while maintaining the original graph structure to obtain a permuted neighbor graph G′ = (V′, E′). At this time, for graph G′, the adjacency relationship of the spots remains unchanged, that is, the adjacency matrix still is \(\tilde{A}\), but the gene expression matrix changes from X to \(X^{\prime} =\{{x}_{1}^{{\prime} },\,{x}_{2}^{{\prime} },\ldots {x}_{N}^{{\prime} }\}\subseteq {R}^{N\times F}\).AutoencoderIn order to effectively integrate gene expression information and spatial location information in ST data, and to remove unnecessary noise from gene expression information, we use an auto-encoder framework to learn the feature embedding and reconstruct gene expression matrix. The designed auto-encoder includes two parts: an encoder Ew and a decoder Dw′.We employ a two-layer graph attention network (GAT)32 as the encoder Ew′. GAT network can utilize an attention mechanism to assign weights to the input data, which enables the model to emphasize key features and reduce attention to irrelevant ones. Encoder Ew takes the regularized adjacency matrix \(\tilde{A}\) and the preprocessed gene expression matrix X as inputs, and outputs the low-dimensional embedding \(Z={E}_{w}(X,\tilde{A})\), \(Z=\{{z}_{1},{z}_{2},{\mathrm{.}}..{z}_{N}\}\subseteq {R}^{N\times F{\prime} }\), where F′ represents the potential embedding dimension of spots. The specific formula of Ew is as follows:$${X}_{i}=\left\{\begin{array}{c}\begin{array}{ccc}{{\rm{ELU}}}({{\rm{BatchNorm}}}({{\rm{GAT}}}_{i}^{(K)}\,({X}_{i-1},\tilde{A}))) & {{\rm{if}}} & i=1\end{array}\\ \begin{array}{ccc}{{\rm{ELU}}}({{\rm{BatchNorm}}}({{\rm{GAT}}}_{i}^{(K)}({{\rm{ELU}}}({X}_{i-1}),\tilde{A}))) & {{\rm{if}}} & 1 \, < \, i \, < \, L\end{array}\\ \begin{array}{ccc}{{\rm{GAT}}}_{i}^{(K)}({{\rm{ELU}}}({X}_{i-1}),\tilde{A}) & {{\rm{if}}} & i=L\end{array}\end{array}\right.$$
(3)
where Xi represents the ith layer’s output, \(GA{T}_{i}^{(K)}\) denotes the ith GAT layer with K heads. Then Z = XL, L signifies the total number of GAT layers in encoder.The decoder \({D}_{w{\prime} }\) takes Z as input and outputs the reconstructed gene expression matrix \(H={D}_{w{\prime} }(Z)\),\(H=\{{h}_{1},h{}_{2},{\mathrm{..}}.{h}_{N}\}\subseteq {R}^{N\times F}\). Since the primary objective of decoder \({D}_{w{\prime} }\) is to reconstruct the original data, it does not focus on feature selection and extraction in the data. Therefore, we directly use two fully connected layers in the decoder, deviating from the GAT structure used in the encoder structure. In \({D}_{w{\prime} }\), the ELU activation function and batch normalization operation are applied to all the hidden layers except the bottleneck layer.Finally, in the auto-encoder module, we adopt the mean square error loss function to calculate the reconstruction loss \({L}_{{\rm{RECON}}}\) of the gene expression matrix, which can be expressed as follows:$${L}_{{\rm{RECON}}}=\mathop{\sum }\limits_{i=1}^{N}\Vert {x}_{i}-{h}_{i}\Vert$$
(4)
where xi and hi are the original and reconstructed gene expression vectors of spot i, respectively.Self-supervised contrastive learningSelf-supervised contrastive learning is a training method that eliminates the need for annotated labels and is employed to learn meaningful feature representations from data. By comparing various views or transformations of the same sample, self-supervised contrastive learning enables the model to learn to group similar samples together and differentiate dissimilar samples. To enhance the informativeness of embedding Z obtained from the encoder Ew, we further refine it through self-supervised contrastive learning across three dimensions: local location based, global feature based, and context feature based contrastive learning.Local location-based contrastive learning (LLCL)Spatial location information in spatial transcriptome data can help to identify differences in gene expression between different tissue structures, thereby facilitating the recognition of spatial domain within spatial transcriptomics. Local location based contrastive learning is specifically designed to enhance the network’ s attention towards properties related to spatial location in spot attributes, so we use it to more accurately distinguish a specific spot from other spots in different locations.For spots in the neighbor graph G, we designate the feature embedding derived from the encoder Ew to be \(Z={E}_{w}(X,A)\), \(Z=\{{z}_{1},{z}_{2},{\mathrm{.}}..{z}_{N}\}\subseteq {R}^{N\times F{\prime} }\). Similarly, for spots in the permuted neighbor graph G′, we set it to be \(Z{\prime} ={E}_{w}(X{\prime} ,A)\), \(Z{\prime} =\{{z{\prime} }_{1},{z{\prime} }_{2},{\mathrm{..}}.{z{\prime} }_{N}\}\subseteq {R}^{N\times F{\prime} }\). Then, for a specific spot i, we set its feature embedding zi as an anchor, its positive pair is \(({z}_{i},{z}_{i}^{{\prime} })\), its negative pairs are \({({z}_{i},{z}_{j}^{{\prime} })}_{i\ne j}\). We utilize cosine similarity function to measure the similarity of feature expression between two spots as follows:$$sim(a,b)=\frac{p(a)\cdot p(b)}{\Vert p(a)\Vert * \Vert p(b)\Vert }$$
(5)
where the function \(p(\cdot )\) represents a projection operation that is constructed using a two-layer multiple-layer perceptron (MLP). Thus, the loss function can be given as follows:$${L}_{1}({z}_{i},{z}_{i}^{{\prime} })=\,\log \left(\mathop{\sum }\limits_{j=1}^{N}{{\rm A}}_{[j\ne i]}\exp (sim({z}_{i},{z}_{j}^{{\prime} }))\right)-\,\log [\exp (sim({z}_{i},{z}_{i}^{{\prime} })/\tau )]$$
(6)
where τ is the temperature hyperparameter, \({{\rm A}}_{[j\ne i]}\subseteq \{0,1\}\) is the indicator function.Considering the neighbor graph G and the permuted neighbor graph G′ are equally important, we also need to set \({z{\prime} }_{i}\) as anchor to calculate the loss \({L}_{1}({z}_{i}^{{\prime} },{z}_{i})\), so the total local location-based contrastive learning loss \({L}_{LLCL}\) can be expressed as:$${L}_{LLCL}=\frac{1}{2N}\mathop{\sum }\limits_{i=1}^{N}[{L}_{1}({z}_{i},{z}_{i}^{{\prime} }),{L}_{1}({z}_{i}^{{\prime} },{z}_{i})]$$
(7)
Global feature-based contrastive learning (GFCL)In addition to considering local location information, we also need to establish a certain relationship between local and global features. This is because the local properties of individual spots are inherently consistent with the global properties when they are obtained from the identical part of the same organism. As a result, we improve the mutual information between the feature representations of every single spot and the global summary of the whole graph, thus allowing the network to cover both local and global-level features.We employ a method resembling local location-based contrastive learning. Firstly, an average aggregation function \(R:Z(Z{\prime} )\subseteq {R}^{N\times F{\prime} }\to s(s{\prime} )\subseteq {R}^{F{\prime} }\) is utilized to get the global summary s or s′ of the graph:$$s=R({E}_{w}(X,\tilde{A}))$$
(8)
$$s^{\prime} =R({E}_{w}(X^{\prime} ,\tilde{A}))$$
(9)
Also, for spot i, we set its feature embedding zi as an anchor. Then its positive sample pair is (zi, s), and its negative sample pair is (zi, s′). A bilinear transformation layer-based discriminator \({D}_{c}:{R}^{F{\prime} }\times {R}^{F{\prime} }\to R\) is designed to give a higher probability score to positive pair than the negative pair. We use the Jensen-Shannon-based binary cross entropy loss to construct the loss \({L}_{2}({z}_{i},s,s{\prime} )\):$${L}_{2}({z}_{i},s,s^{\prime} )=\mathop{\sum }\limits_{i=1}^{N}\, [\log {D}_{C}({z}_{i},s)+\log (1-{D}_{C}({z}_{i},s{\prime} ))]$$
(10)
Since the importance of neighbor graph G and the permuted neighbor graph G′ are equal, we also need to set \({z{\prime} }_{i}\) as anchor to calculate the loss \({L}_{2}({z{\prime} }_{i},s{\prime} ,s)\):$${L}_{2}({z}_{i}^{{\prime} },s^{\prime} ,s)=\mathop{\sum }\limits_{i=1}^{N} \, [\log {D}_{G}({z}_{i}^{{\prime} },s^{\prime} )+\log (1-{D}_{G}({z}_{i}^{{\prime} },s))]$$
(11)
Therefore, the total global feature-based contrastive learning loss \({L}_{GFCL}\) is calculated as:$${L}_{{\rm{GFCL}}}=\frac{1}{2N}({L}_{2}({z}_{i},s,s{\prime} )+{L}_{2}({z}_{i}^{{\prime} },s{\prime} ,s))$$
(12)
Context feature-based contrastive learning (CFCL)Inspired by literature33, we know that pots within the same tissue type tend to exhibit similarities in terms of marker genes and morphological structures. Building upon this knowledge, in addition to global and local level attributes, we further try to learn the embedding with context feature by maximizing the mutual information between the individual spot representation and the cluster-level summary.We apply the K-means clustering algorithm on Z and acquire K initial clusters. Then, the centroid of each cluster \({\phi }_{k}\subseteq {R}^{1\times F{\mbox{‘}}}\) is iteratively updated as follows:$${\phi }_{k}=\frac{{\sum }_{i}{r}_{ik}{z}_{i}}{{\sum }_{i}{r}_{ik}},\,{{\rm{with}}}\,{r}_{ik}=\frac{\exp (-\gamma sim({\phi }_{k},{z}_{i}))}{{\sum }_{k}\exp (-\gamma sim({\phi }_{k},{z}_{i}))}$$
(13)
where k = 1, 2, …, K, and γ represent an inverse-temperature hyperparameter. For every individual spot in neighbor graph G, we aim to maximize the mutual information between its feature embedding zi and corresponding cluster-level summary ci, the value of ci can be calculated as:$${c}_{i}=\sigma \left(\mathop{\sum }\limits_{k=1}^{K}{r}_{ik}{\phi }_{k}\right)$$
(14)
where rik represents the likelihood of assigning spot i to cluster k and \({\sum }_{k=1}^{K}{r}_{ik}=1\).In line with the global feature-based contrastive learning, we design another bilinear transformation layer-based discriminator \({D}_{G}:{R}^{F{\prime} }\times {R}^{F{\prime} }\to R\) to assign a higher score to the positive pair (zi, ci) than the negative pair \(({z}_{i}^{{\prime} },{c}_{i})\). Then, we construct the context feature-based contrastive learning loss LCFCL using the Jensen–Shannon-based binary cross-entropy loss, which is defined as follows:$${L}_{{\rm{CFCL}}}=\mathop{\sum }\limits_{i=1}^{N}{{\rm E}}_{(X,\tilde{A})}[\log {D}_{G}({c}_{i},{z}_{i})]+\mathop{\sum }\limits_{i=1}^{N}{{\rm E}}_{(X{\prime} ,\tilde{A})}[\log (1-{D}_{G}(c{}_{i},{z}_{i}^{{\prime} }))]$$
(15)
weighted by λi, the total contrastive learning loss LCL can be given as:$${L}_{{\rm{CL}}}={\lambda }_{1}{L}_{{\rm{LLCL}}}+{\lambda }_{2}{L}_{{\rm{GFCL}}}+{\lambda }_{3}{L}_{{\rm{CFCL}}}$$
(16)
Total model lossWith the LRECON obtained in the Autoencoder section and LCL obtained in the self-supervised contrastive learning section, we calculate the total loss function LTOTAL of the GAAEST model as follows:$${L}_{{{\rm{TOTAL}}}}=\alpha {L}_{{\rm{CL}}}+\beta {L}_{{\rm{RECON}}}$$
(17)
where α and β are both weight parameters. GAAEST obtains better-reconstructed gene expression H by minimizing the total loss.Throughout the learning process in GAAEST, we construct the reconstruction loss for gene expression and the contrastive learning loss for embedding optimization. The former aims to capture informative features from both spatial location and gene expression, while the latter helps to refine the potential embeddings from three different levels and enable a more comprehensive and discriminative representation.Spatial clusteringAfter the training process, we obtain the optimized reconstructed gene expression matrix H. In order to distinguish domains that exhibit similar gene expression and spatial proximity, it is essential to perform spatial clustering operation. In this case, we use the non-spatial allocation algorithm Mclust to assign spots into different spatial domains.Mclust is a clustering method based on the maximum expectation algorithm. It assumes that the data come from the mixed Gaussian distribution, and determines the best model and clustering result by the maximum likelihood function. Each cluster in the clustering results is considered as a spatial domain, encompassing spots that demonstrate similar gene expression profiles and spatial proximity.To adjust the clustering results further, we design an optional optimization step. For a given spot i, we take the neighboring spots within a radius r as its neighbors, and then the labels assigned to these neighbor spots are queried. If more than 50% of the neighbors are assigned to label n, GAAEST will reassign the label n to spot i. The settings for r vary depending on the data type.GAAEST spatial domain identification algorithmThe GAAEST spatial domain identification algorithm is described as follows:
Algorithm
GAAEST
Input: Gene expression matrix, Spatial location matrix
Number of iteration T, Encoder Ew, Decoder Dw′
Output: Spatial domain recognition results
1: Get preprocessed gene expression X
2: Construct neighbor graph G and permuted graph G’
3: For t = 0 to T-1 do
4: Get feature embeddings Z = Ew (G), Z’ = Ew (G’)
5: Get reconstructed feature H = Dw′ (Z)
6: Calculate reconstruction loss LRECON by Eq. 4
7: Calculate contrastive learning loss LCL by Eq. 16
8: Get total loss LTOTAL = LRECON + LCL
9: Update the parameters of Ew and Dw′ by minimizing the total loss LTOTAL
10: End for
11: H = Dw′ (Ew (G))
12: Apply Mclust to H
13: Return the clustering results, in which each cluster is considered a spatial domain
Experimental detailsIn experiments, we set k = 3 for constructing the neighbor graph. The input, hidden, and output dimensions of the encoder are set to 3000, 256, and 64, respectively. Also, these parameters of the decoder are set to 64, 256, and 3000, respectively. In the contrastive learning module, the temperature hyperparameter is set to 0.5. In the spatial clustering module, the settings for optimization radius r vary depending on the datasets. For the DLPFC and Human breast cancer dataset, the value is 35. For the remaining datasets, the optional clustering optimization step is not needed as they all have fine-grained domains, so the value is set to 0; During the training process, the total loss function is optimized by the Adam optimizer34 with an initial learning rate of 0.001, weight decay of 0, and a total number of training rounds of 600.GAAEST is trained on GeForce RTX A4000 Ti GPU with 16 GB of video memory, 6 x E5-2680 v4 CPU, 28 GB of RAM, and implemented by Python and PyTorch_pyG35. The training time varies from 30 s to 3 min on different datasets.Evaluated metrics and criteriaFor the spatial domain identification task of spatial transcriptome data, the adjusted rand index (ARI), normalized mutual information (NMI), adjusted mutual information (AMI), and Fowlkes–Mallows index (FMI) are employed as evaluation metrics if ground truth exists in the dataset. Silhouette coefficient (SC) score and Davies–Bouldin (DB) score are introduced when the ground truth doesn’t exist. These metrics are widely utilized to assess the effectiveness of spatial domain identification models.Adjusted rand index (ARI): the ARI measures the degree of agreement between the clustering results and the ground truth by comparing the consistency between all pairs of samples in their clusters. Assuming that \(U=\{{U}_{1},{U}_{2},{\mathrm{..}}.{U}_{C}\}\) and \(V=\{{V}_{1},{V}_{2},{\mathrm{..}}.{V}_{C}\}\) are two clustered label sets, the ARI value is calculated by the formula:$${{\rm{ARI}}}=\frac{{\sum }_{i,j}\left(\begin{array}{c}{n}_{ij}\\ 2\end{array}\right)-\left[{\sum }_{i}\left(\begin{array}{c}{n}_{i.}\\ 2\end{array}\right){\sum }_{j}\left(\begin{array}{c}{n}_{.j}\\ 2\end{array}\right)\right]/\left(\begin{array}{c}n\\ 2\end{array}\right)}{\frac{1}{2}\left[{\sum }_{i}\left(\begin{array}{c}{n}_{i.}\\ 2\end{array}\right)+{\sum }_{j}\left(\begin{array}{c}{n}_{.j}\\ 2\end{array}\right)\right]-\left[{\sum }_{i}\left(\begin{array}{c}{n}_{i.}\\ 2\end{array}\right){\sum }_{j}\left(\begin{array}{c}{n}_{.j}\\ 2\end{array}\right)\right]/\left(\begin{array}{c}n\\ 2\end{array}\right)}$$
(18)
where ni., n.j is the number of spots belonging to Ui and Vj, respectively, nij denotes the number of spots corresponding to Ui and Vj.Normalized Mutual Information (NMI): NMI measures the mutual information between two sets. The NMI between U and V is calculated as:$${{\rm{NMI}}}(U,V)=\frac{{{\rm{MI}}}(U,V)}{{{\rm{mean}}}(H(U),H(V))}$$
(19)
where MI is the mutual information between U and V, and H(U) and H(V) are the entropy of the clustering labels, the specific formulas are as follows:$${{\rm{MI}}}(U,V)=\mathop{\sum }\limits_{i=1}^{|U|}\mathop{\sum }\limits_{j=1}^{|V|}P(i,j)\log (\frac{p(i,j)}{p(i)p{\prime} (j)})$$
(20)
$$H(U)=-\mathop{\sum }\limits_{i=1}^{|U|}P(i)\log (P(i))$$
(21)
$$H(V)=-\mathop{\sum }\limits_{j=1}^{|U|}P{\prime} (j)\log (P^{\prime} (j))$$
(22)
where \(P(i)=|{U}_{i}|/N\) and \(P{\prime} (j)=|{V}_{j}|/N\) are the probabilities of the selected spot falling into Ui or Vj.Adjusted mutual information (AMI): AMI is a measure that incorporates the baseline of stochastic consistency and is an improvement over the ARI. The formula for AMI is as follows:$${{\rm{AMI}}}(U,V)=\frac{{{\rm{MI}}}(U,V)-E\{{{\rm{MI}}}(U,V)\}}{\max \{H(U),H(V)\}-E\{MI(U,V)\}}$$
(23)
Fowlkes–Mallows Index (FMI): For the paired samples of the same category, the Fowlkes–Mallows index measures the degree of fit between the clustering results and the ground truth by calculating their similarity. The formula for calculating the FMI is as follows:$${{\rm{FMI}}}=\frac{{\rm{TP}}}{\sqrt{({{\rm{TP}}}+{{\rm{FP}}})({{\rm{TP}}}+{{\rm{FN}}})}}$$
(24)
where TP and FP denote the number of samples that are correctly or incorrectly clustered in the ground truth, respectively. FN denotes the number of positively clustered samples that are not correctly identified.Silhouette coefficient (SC) score: SC is computed by evaluating the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The SC score for a sample is calculated as (b − a)/max(a, b), and the optimal SC score is 1, indicating well-separated and compact clusters, while the worst score is −1, indicating poor clustering.Davies–Bouldin (DB) score: DB score provides an assessment of clustering quality by considering the average similarity between each cluster and its most similar cluster. Similarity is measured based on the ratio of within-cluster distances to between-cluster distances. A lower DB score indicates better clustering, with a minimum score of zero.Methods used for comparisonTo demonstrate the effectiveness of GAAEST in spatial domain identification, we compare GAAEST with six state-of-the-art methods, including SEDR13, stLearn12, SpaGCN14, SpaceFlow20, STAGATE16, and GraphST21. To offer a comparative analysis of these methods, we present Supplementary Table 1, providing a summarized overview of some approaches, which are the state-of-the-art methods using feature extraction or contrastive learning. For each dataset, the number of target clusters set by the six comparison methods is the same. For SEDR, STAGATE, and GraphST, all parameters are kept at their default settings. For stLearn, in order to get better clustering results, we assign distinct weight values for each dataset within its SME_normalize function. Specifically, In the case of DLPFC, mouse embryo and human breast cancer data, weights = “physical_distance”. In the case of mouse brain anterior data, weights = “weights_matrix_pd_md “. For SpaGCN, we set the parameter about “Histology” to “False” if no histology image exists in the dataset. For SpaceFlow, we adjust the resolution in clustering to get the target number of clusters, and the other parameters are default. The results of all comparison methods in the experiments are reproduced by us under the same experimental environment.Statistics and reproducibilityNo statistical method was used to predetermine the sample size. All collected data were included in the analyses without exclusion. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles