Integrating large-scale single-cell RNA sequencing in central nervous system disease using self-supervised contrastive learning

Overview of scCM architecturescCM is constructed based on a momentum contrastive learning framework (MoCo v335) with the aim of learning informative representations of CNS scRNA-seq data (Fig. 1a). It comprises three modules: an encoder, a momentum encoder, and a predictor head, all constructed using fully connected neural networks. The encoder and momentum encoder receive a pair of gene expression vectors as input. The gene expression vector fed into the encoder is transformed into an embedding and then projected as a vector q (representing the query) by the predictor head. Simultaneously, the gene expression vector fed into the momentum encoder is transformed into a vector k (representing the key). If the two inputs are the same, they are labeled as a positive pair (q, k+); otherwise, they are labeled as a negative pair (q, k-). The InfoNCE loss function is utilized to minimize the distance between positive pairs and maximize the distance between negative pairs. During training, the encoder and predictor head are updated based on the back propagated gradients by the InfoNCE loss function. The momentum encoder is updated using the momentum strategy (also known as the exponential moving average) based on the learned weights of the encoder. scCM promotes spatial proximity among similar cells, while gradually distinguishing dissimilar cells. Specifically, the weights are updated as a combination of the current gradient and the previous weights, moderated by a momentum coefficient. As a result, cells of the same type cluster together as closely as possible, while cells of different types separate as far as possible. After being trained, the embedding vectors produced by the encoder can be regarded as representations of CNS cells that can be utilized for downstream tasks.Fig. 1: Illustration of scCM architecture and CNS datasets.a scCM is constructed using a momentum contrastive learning framework with symmetric encoders. Its goal is to minimize the embedding distance between similar CNS cells/clusters and maximize the embedding distance between dissimilar CNS cells/clusters. The embeddings learned by Encoder can be informative representations for various downstream tasks, such as clustering, batch effect correction, and cell annotation; b Geographic distribution of the collected CNS datasets; c Species and diseases included in CNS datasets; d Data distribution of each category of CNS data.CNS datasetsWe gathered CNS scRNA-seq datasets related to neurological diseases from the GEO database over the past five years. A total of 18 datasets with cell-type labels were selected from CNS studies worldwide2,36,37,38,39,40,41,42,43,44,45,46,47,48,49 (Fig. 1b). Additionally, two unlabeled datasets were obtained to assess the annotation capabilities of scCM, where one is the Alzheimer’s disease (AD) dataset50 from the GEO database, and the other one is the cerebral tumor dataset obtained from our institution. All datasets comprise 924,425 cells from 4 continents (America, Asia, Europe, and Oceania), encompassing 4 species (human, primate, rodent, and fish) (Fig. 1c) and covering ten subcategories of CNS diseases (AD, Alzheimer’s disease; BM, brain metastase; GBM, glioblastoma; HD, Huntington’s disease; MB, medulloblastoma; MCD, malformations of cortical development; MS, multiple sclerosis; NHD, Nasu-Hakola disease; TBI, traumatic brain injury) (Fig. 1d). Based on the number of cells, the datasets are categorized into two small-scale datasets (<10,000 cells), 7 medium-scale datasets (10,000-50,000 cells), and nine large-scale datasets (>50,000 cells). Two of the large-scale datasets have cell counts exceeding 100,000. Detailed characteristics of all datasets are provided in Supplementary Table 1.scCM efficiently improves the performance of CNS scRNA-seq analysisTo demonstrate the effectiveness of scCM in CNS scRNA-seq analysis, we evaluated it on 4 CNS datasets (Anderson, Fournier, Ryan, Zhou), which contain rich information about cell annotations, groups, and batches. We compared scCM with several popular methods, including Seurat12,13, Harmony, LIGER, scVI14, MARS15, CLEAR32 and Concerto4. We also compared with scGNN, DESC, SCLSC, and SMILE (Supplementary Methods). The performance was evaluated using 7 metrics (Supplementary Methods): Accuracy (Acc), Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), k-nearest neighbor Batch Effect Test (kBET), Silhouette Score (SS), Homogeneity Score (HS), and V-Measure Score (VMS).scCM achieves the best performance across all 4 datasets in terms of Acc, ARI and VMS, while MARS achieves the best performance in terms of batch effects (kBET) (Fig. 2a and Supplementary Table 2). We visualized the distribution of clustering, batch-effect correction, and groups specifically on the Fournier’s dataset (Fig. 2b and Supplementary Fig. 1). These results demonstrate that scCM achieves the highest similarity and consistency between clustering results and true categories. The MARS model, with its strong learning abilities, employs a nonlinear embedding function to cluster similar cell types, characterized by strong batch effect elimination. However, this approach may blur boundaries between closely related cells or subtypes, as seen with the tight grouping of DC1, DC2, and pDC, and the mixing of Neuroblast and Neuron (Fig. 2c). These findings suggest that MARS appears to over-correct batch effects indicating a challenge in differentiating similar cell subgroups. Similar conclusions can be drawn from the other three datasets (Supplementary Fig. 2). The clustering efficiency of scCM is comparable to that of Concerto, both of which are based on contrastive learning frameworks. However, scCM exhibits stronger batch effect correction capabilities. Unlike CLEAR and concerto, scCM does not utilize a data augmentation approach.Fig. 2: Performance comparison and visualization between scCM and 7 popular methods.a Performance comparison of different methods on 4 CNS datasets. Based on the weighted average of all performance metrics, the scCM and MARS models demonstrate superior performance. b The UMAP visualization of compared methods on the Fournier dataset. c The UMAP visualization of MARS, Concerto, and scCM for subclusters of the Fournier dataset. d The performance of comparison methods under different dropout rates.Furthermore, we also evaluated the impact of the high dropout rate and the efficiency of large-scale CNS datasets. A high dropout rate makes CNS scRNA-seq analysis more challenging as it results in the loss of gene information. It is evident that scCM achieves the best performance under various dropout rates in terms of ARI (Fig. 2d), which indicates that scCM has strong robustness. However, interestingly, both MARS and scCM exhibit better batch correction in kBET, as the dropout rate increases. This result suggests batch correction can be improved by appropriately reducing the number of highly variable genes (Supplementary Fig. 3). To evaluate the efficiency of large-scale datasets, we analyzed the runtime of various deep learning-based methods (Table 1). To assess the running time across millions of datasets, we integrated all collected datasets (n = 924,425) for testing. The findings revealed that scCM can handle clustering tasks in million-scale datasets, with running time of 1122.4 ± 72.7 sec, demonstrating scCM’s capability in effectively handling clustering tasks for larger-scale datasets.Table 1 The running time of various comparison models in the 4 datasets (sec.)scCM provides promising clustering analysis across large-scale CNS datasetsWe further investigated the generalization of scCM in CNS scRNA-seq analysis, considering various critical factors such as sample size, number of clusters, highly variable genes, species, and diseases. We first assessed the impact of sample size and cluster numbers across a wide range of dataset scales (ranging from 789 to 105,332 cells) and cluster numbers (ranging from 4 to 21) (Fig. 3a). Leiden clustering51 on scCM embeddings achieves outstanding clustering performance with Acc and ARI greater than 0.9, except on the Siddique dataset (ARI of 0.84). We then investigated the impact of the number of highly variable genes on clustering analysis in different CNS datasets (Fig. 3b and Supplementary Tables 3 and 4). Although a limited number of highly variable genes pose significant challenges in CNS cell representation learning, scCM exhibits commendable ability in clearly distinguishing clusters, even with extremely small highly variable genes (Supplementary Figs. 4 and 5). Analysis of variance results indicates that scCM’s performance is more stable across various datasets when using the top 3000 HVGs. Therefore, to facilitate training and comparison across multiple datasets, we have set the top 3000 HVGs as the default parameter for our model. Furthermore, we also examined the robustness of scCM across continents, species, and diseases (Fig. 3c). It is obvious that scCM holds high performance in terms of Acc and ARI on CNS datasets from different continents, species, and diseases. These results highlight the strong generalization and robustness of scCM, which effectively captures the variations across diverse CNS cells, demonstrating its suitability for the species and CNS diseases examined in this study. In non-neural tissue datasets52, the scCM also demonstrated favorable clustering outcomes.Fig. 3: scCM presents promising clustering analysis for large-scale CNS datasets.a Performance of scCM on 18 labeled CNS datasets. b The effect of the number of highly variable genes on the clustering performance of scCM in terms of ARI (The dots represent different datasets). c Clustering performance comparison on different continents, species, and CNS diseases (The value of n in parentheses represents the number of datasets included). d scCM effectively groups similar brain metastatic subtypes. e The relationship between cell clusters in tumors and oligodendrocytes. f scCM is harnessed to annotate unknown cells in Fournier data.Moreover, we conducted the promising clustering analysis to capture reliable cluster relationships, where clusters with similar functions are projected into close space. In the analysis of the Gonzalez dataset with five brain metastatic types (Fig. 3d), the same cancer subtypes are adjacent. Specifically, the three breast subtypes are located at the top, while the two ovarian subtypes and three melanoma subtypes are on the left and bottom, respectively. However, for the lung metastases, Lung1 is clearly separated from the other two non-small cell lung cancer clusters (Lung2 and Lung3) because Lung1 is a small cell carcinoma cluster. Furthermore, in Datta’s dataset, IDH-mutant oligodendrocytes are positioned closely to glioblastoma by scCM (Fig. 3e). Therefore, scCM can effectively represent functional similarity in the visualization results.To leverage the advantage of scCM to capture cluster relationships, we employed it to annotate the unknown cells in Fournier data. When visualizing cell clusters in Fournier data (Fig. 3f), we observed that neuroimmune-related cells tend to congregate in the upper-right region, while cells associated with the substance circulation pathway composition are predominantly categorized in the lower-right region. The unknown cluster and neuron cells are located in the lower-left region. Thus, according to the spatial distribution, we can deduce that the unknown cluster is closely situated to astrocytes, neurons, and neuroblasts. Therefore, we hypothesized that the unknown cluster is likely a kind of glia with a repairing function similar to astrocytes. To verify this assumption, we conducted GO enrichment analysis and discovered that the gene expressions in the unknown cluster are related to neural injury repair, myelination, and neurotransmitter function. By comparing marker comparisons (Fig. 3f), we successfully identified the unknown cluster as olfactory ensheathing cells, a specialized glial cell type that promotes axon growth, myelination, and neural nourishment53,54.CNS cells integration for unveiling the relationship between cell types/subtypes and neurodegenerative diseasesHere, we employed scCM to integrate data from 4 neurodegenerative diseases (AD, HD, MS, and NHD) in 5 controlled human trials to comprehensively analyze cell types and states, subsequently revealing relationships between CNS cell types/subtypes and neurodegenerative diseases. To ensure data consistency, we first normalized and cleaned all datasets by removing cells with inaccurate annotation information and those with ambiguous or unspecified types. Additionally, cell types less than 2000 cells were also excluded. After the preprocessing, we have 280,000 cells, including oligodendrocytes (114,447 cells), neurons (113,086 cells), astrocytes (31,016 cells), oligodendrocyte precursor cells (11,784 cells), microglia (9747 cells), and endothelial cells (2688 cells). Finally, all cells were integrated by scCM, which corrected batch effects and divided them into clusters. The visualizations in Fig. 4a and Supplementary Fig. 6 demonstrate that scCM effectively removes batch effects and accurately aligns cells across datasets, despite differences in experimental protocols, sequencing technologies, and gene expression measurements.Fig. 4: Visualization of integrated 4 neurodegenerative disease datasets.a UMAP visualization of the combined datasets before and after batch correction with scCM. b Astrocytes are divided into five subtypes with the distinct percentage in 4 neurodegenerative diseases. c Bar plot showing the relative percentage of different astrocyte subtypes comparing the control group and the 4 disease subgroups. d Heatmaps of gene expression in neurodegenerative disease groups and astrocyte subpopulations.Because neurodegenerative diseases are typically characterized by progressive heterogeneity of glial cells, we focused on a well-known type of glial cells, astrocytes, which have been implicated in various neurodegenerative diseases55,56. After eliminating non-astrocytic cells from the reference, astrocytes were divided into five subtypes with distinct percentages in 4 neurodegenerative diseases (Fig. 4b and Supplementary Fig. 7). Specifically, Astro 0 subtype is the principal component of HD, rarely found in AD and NHD (Fig. 4c). Moreover, based on the comparison of high variable genes (Fig. 4d), Astro 0 exhibits high expression of IGF2R, a gene known to play a crucial role in cognitive processes such as learning and memory. Therefore, Astro 0 can be used to annotate the HD samples. Besides, both Astro 1 and 4 subtypes are found in all 4 neurodegenerative diseases, indicating that they play critical roles in nervous system development. However, Astro 1 is the principal component, comprising more than 80% in AD and NHD, demonstrating that the two diseases share gene expression profiles during CNS cell functional changes. This is consistent with the findings that AD and NHD have a close relationship40. Nonetheless, Astro 4 takes 60% of components in MS, but the remaining 40% consists of other three astrocyte subtypes. Thus, the cell heterogeneity in MS is higher than in other neurodegenerative diseases, resulting in more complexity of gene expression profiles in MS. Notably, Astro 2 and 3 subtypes are exclusively found in MS samples. According to the high gene expression analysis, Astro 2 is identified as a type of potential inflammatory astrocyte expressing up-regulated immune-related and apoptosis-related genes (HSPs, NAMPT, and TPST1). The extensive neuronal loss in neurodegenerative diseases is attributed to apoptosis, and emerging evidence suggests that dysregulated apoptosis may be involved in the pathogenesis of MS57,58. Meanwhile, Astro 3 is enriched with genes related to neuron development, especially cilium function, where cilia are tiny microtubule-based signaling devices that regulate diverse physiological functions. Therefore, the development of MS is potentially associated with the heterogeneous transformation of astrocyte subtypes, with the emergence of inflammatory astrocytes emerging as a cause for concern. Furthermore, we also observed heterogeneity among oligodendrocytes and microglia (Supplementary Fig. 8).By using scCM, the integration of large-scale CNS datasets can facilitate comparative analysis across different CNS diseases, enabling the exploration of cellular heterogeneity under various physiological and pathological conditions, and contributing to a comprehensive exploration of CNS disease causality.CNS cell annotation by a metadata referenceAccording to the above results, scCM demonstrates outstanding robustness in CNS cell clustering by bringing similar CNS cells closer and separating dissimilar ones. Therefore, we believe that scCM has the potential to be regarded as an annotation method to identify unknown cells. The unlabeled CNS cells are initially mapped into the spatial distribution of a metadata reference, they are annotated based on the nearest cell clusters in metadata.Initially, we harnessed scCM to annotate Soreq’s AD data based on the integrated dataset above as a metadata reference. All cells in the Soreq data were manually identified according to specific marker genes (Fig. 5a), and all cell types are present in the metadata reference. We compared the annotation performance of scCM with MARS. It is obvious that scCM successfully aligns cell types in Soreq with the metadata reference, including the small cluster of endothelial cells (n = 381) (Fig. 5b). However, MARS struggles to effectively align Soreq’s data with metadata reference, making it challenging to classify and annotate subpopulations within the clusters. In particular, MARS erroneously assigns oligodendrocytes as annotations for endothelial cells, disregarding the distinct functional characteristics of these two cell clusters. These results indicate scCM provides a more similar spatial distribution between metadata reference and Soreq’s data. Furthermore, the annotation results in two confusion matrices demonstrate that scCM exhibits greater sensitivity in annotating small cell clusters.Fig. 5: Annotation of unknown cells.a Manual annotation of the Soreq’s datasets diagnosed with Alzheimer’s disease (AD). b The annotation by scCM and MARS, and the confusion matrices of annotation measured in terms of accuracy. c Pituitary tumor cells mapped on the reference and divided into 4 clusters by scCM. d The visualization of clusters showing a relationship between reference and 4 clusters of pituitary tumor cells by scCM and MARS.To further assess the annotation performance of scCM in recognizing unobserved cell types, we then used the above integrated neurodegenerative disease dataset as a metadata reference to annotate the pituitary tumor data. The pituitary tumor samples are divided into 4 clusters in the metadata reference (Fig. 5c), and each cluster has distinct gene expression patterns (Supplementary Fig. 9). Specifically, cluster 4 seems to be associated with endothelial cells in the metadata reference, suggesting that cluster 4 is annotated as endothelial cells. This annotation is further corroborated by the marker gene expression of CLDN5 and ITM2A. The other three cell clusters are distinct from the reference, since their cell types are not observed in the metadata reference. Cluster 1 is distinct from the reference and annotated as corticotrophs based on the marker genes TBX19 and NEUROD1. Cluster 2 and 3 are close to microglia, which are intracranial immune cells responsible for regulating the maintenance of neuronal networks. Then, based on marker genes, cluster 2 and 3 are identified as macrophages and T cells, respectively. This result demonstrates that scCM is not only robust in rejecting annotation for unobserved cell types, but also offering relative spatial information of unobserved cell types in the reference space. In contrast, although MARS also has the potential to detect novel clusters, it lacks a reliable spatial distribution where similar CNS cells are in close proximity, which leads to incorrect spatial information for cell annotation (Fig. 5d). These results demonstrate that scCM efficiently annotates CNS cell types.

