scCAD: Cluster decomposition-based anomaly detection for rare cell identification in single-cell expression data

Overview of scCADSingle-cell RNA sequencing data often consist of a diverse range of cell types, each characterized by specific functions and significant variations in cell counts. This can complicate the identification of rare cell types during initial clustering, as they may be indistinguishable from major cell types based partial or on global gene expression.To tackle this challenge, scCAD employs an ensemble feature selection method to effectively preserve differentially expressed (DE) genes in rare cell types. Similar to GiniClust and CIARA, scCAD emphasizes the importance of the feature selection procedure, which plays a crucial role in clustering. In contrast to traditional approaches that rely solely on the most variable genes for analysis, scCAD combines the most important genes by utilizing initial clustering labels of cells based on global gene expression and a random forest model26,27. Then, scCAD proposes an innovative approach by decomposing the major clusters in the initial clustering through iterative clustering based on the most differential signals within each cluster. After cluster decomposition, clusters serve as the fundamental units rather than individual cells. We define the dominant cell type of a cluster as the type to which the majority of cells in the cluster belong. The rarity of specific cell types is reflected in the number of clusters they dominate. The number of clusters dominated by rare cell types is significantly lower than those dominated by major cell types. For improved computational efficiency, scCAD reduces the number of clusters by merging some of the nearest clusters. This is accomplished by merging clusters with the closest Euclidean distance between their centers. The set of clusters obtained from the initial clustering, cluster decomposition, and cluster merging are respectively defined as I-clusters (initial clusters), D-clusters (decomposed clusters), and M-clusters (merged clusters). For each cluster in M-clusters, scCAD utilizes differential expression analysis to identify a specific list of candidate DE genes. Due to limited quantity, rare cell types exhibit a higher degree of independence in the corresponding DE gene list of their respective cluster. scCAD employs an isolation forest model28 using the candidate DE gene list to calculate the anomaly score of all cells. An independence score is computed by assessing the overlap between highly abnormal cells and those within the cluster, serving as a measure of each cluster’s rarity. Figure 1 shows a schematic pipeline of scCAD, and the “Methods” section provides a comprehensive explanation of the step-by-step process in scCAD.Fig. 1: Overview of scCAD.scCAD employs an ensemble feature selection approach, combining the benefits of highly variable genes (HVG) and highly important genes (HIG). It then decomposes the major clusters in I-clusters through iterative clustering. To enhance computational efficiency, certain nearest clusters are merged. For each cluster in M-clusters, scCAD conducts anomaly detection by analyzing the corresponding differentially expressed (DE) genes, assigning an independence score to each cluster. Finally, scCAD provides the user with several potential rare cell clusters according to the independence score.Benchmarking scCAD in real datasetsTo comprehensively evaluate scCAD, we compare it with ten state-of-the-art methods designed for identifying rare cell types across twenty-five real scRNA-seq datasets representing diverse biological scenarios. The specifics of these datasets can be found in Supplementary Table 1. The evaluation of different methods is conducted using the F1 score for rare cell types, which effectively captures the trade-off between precision and sensitivity (Supplementary Table 2 and Fig. 2a). As shown in Fig. 2a and Supplementary Table 2, scCAD achieves the overall highest performance (F1 score = 0.4172) and exhibits performance improvements of 24% and 48% compared to the second and third-ranked methods (SCA: 0.3359, CellSIUS: 0.2812), respectively.Fig. 2: Evaluating scCAD against ten state-of-the-art methods for identifying single-cell rare types on twenty-five real datasets.a Comparing the distribution of F1 scores across all datasets (n = 25 datasets) in identifying rare cell types. Boxes extend from the first to the third quartile (Q1–Q3) with a line in the middle that represents the median. Lines extending from both ends of the box indicate variability outside Q1 and Q3. The minimum/maximum whisker values are calculated as \({\mbox{Q}}1/{\mbox{Q}}3-/+1.5\times {\mbox{IQR}}\). b Comparison of the total number of datasets in which each method successfully identifies at least one rare cell type. Source data are provided as a Source Data file.In addition to the F1 score, we employ four other measurements: the accuracy of identifying rare cell types, G-mean (geometric mean of precision and recall), Cohen’s Kappa, and Matthews correlation coefficient (MCC). The accuracy of identifying rare cell types is defined as \({{ACC}}_{{rare\; cell\; type}}=\frac{{TRC}}{{IC}}\), where TRC represents the number of correctly identified rare cells and IC represents the total number of cells predicted as rare cell types. Since rare cell identification methods do not provide prediction probabilities, we do not use AUC as an evaluation metric. Supplementary Fig. 1 shows the distribution of the performance measured by using these four metrics across all datasets. The detailed data is provided in Supplementary Tables 3, 4, 5, and 6, respectively. As shown in Supplementary Fig. 1, scCAD demonstrates the overall highest performance (Accuracy = 0.4156, G-mean = 0.4412, Kappa = 0.3933, and MCC = 0.4162) and exhibits performance improvements of 28%, 19%, 26%, and 21% compared to the second-ranked method (SCA: Accuracy = 0.3239, G-mean = 0.3704, Kappa = 0.3128, MCC = 0.3449), respectively.Furthermore, we showcase the distribution of rankings for different methods across five measurements, including the F1 score, on each dataset (Supplementary Fig. 2). As shown in Supplementary Fig. 2, scCAD is one of the top three algorithms on 16 (MCC and Kappa) and 17 (Accuracy, F1 score, and G-mean) of the 25 datasets.During the testing process, we observed that several methods are not sufficiently adaptable to all datasets representing diverse scenarios. For example, the series methods of GiniClust may introduce errors by failing to identify high Gini genes29. Furthermore, certain methods (such as RaceID) may encounter challenges in generating results for datasets containing more cells due to lower computational efficiency12. In contrast, scCAD, EDGE, FiRE, SCA, and SCISSORS can run effectively on all datasets, showcasing their greater suitability for data analysis across a wide range of biological scenarios.To further evaluate the performance of these algorithms, we count the total number of datasets in which each method successfully identifies at least one rare cell type (Fig. 2b). Let Spre be the set of rare cells identified by one method and St be the set of cells for rare type t. |S| denotes the size of the set S. if \(\left|\frac{{S}_{{pre}}\cap {S}_{t}}{{S}_{t}}\right|\) is larger than 30%, we consider that this method can successfully identify cell type t. The setting of this approach is inspired by certain cell annotation methods30, which suggest that accurately annotating a cell type can be achieved based on the information from 30% of cells belonging to that type. The total number of rare cell types successfully identified by different methods across all datasets is shown in Supplementary Table 7. As illustrated in Fig. 2b and Supplementary Table 7, scCAD demonstrates advantages by successfully identifying rare types in 20 datasets. Meanwhile, we further compare scCAD with four other methods (CellSIUS: 16, EDGE: 16, GapClust: 12, and SCA: 18). By combining Supplementary Tables 2 and 7, we calculate the average F1 score of these five methods on the corresponding datasets where they successfully identified rare cell types. scCAD also demonstrates an advantage (F1 score = 0.5208) compared to the other methods (CellSIUS: 0.3339, EDGE: 0.3954, GapClust: 0.2940, and SCA: 0.4661).We observe significant variation in the number of rare cell types across different datasets in Supplementary Table 7, in which there are 11 datasets with two or more rare cell types and 14 datasets with only one rare cell type. As shown in Supplementary Table 7, scCAD can identify the rare cell type in, 10 of the 14 datasets and also identify 2 or more rare cell types in 8 of the 11 datasets. In summary, scCAD excels at identifying rare cell types in diverse biological scenarios.Feature selection effectively preserves the rare cell type-specific genesFeature selection is crucial for identifying rare cell types, as it aids in extracting and preserving key features specific to these types, thereby reducing noise and redundant information, and improving the ability to identify and distinguish these types. Most current methods for identifying rare cell types rely on a specific set of highly variable genes (HVG)12,13,15, which exhibit significant expression changes across cells, thus potentially providing more information. Our previous study illustrated that highly important genes (HIG) based on random forests have been demonstrated to enhance clustering performance27. The gene selection strategy of scCAD involves merging and removing duplicates from the top 2000 HVG and the top 2000 HIG.To demonstrate the effectiveness of this strategy, we assess whether the genes selected by scCAD encompass genes specific to rare cell types. Specifically, we first apply Wilcoxon’s rank sum test to identify the top 50 differentially expressed (DE) genes for each rare cell type in the dataset, which are commonly utilized to indicate the type’s differential signals31,32. Then we collect these genes to form a reference gene set Sref, which is regarded as having rare cell type-specific signals. Assume that Sselect is the selected gene set, we define three overlap rates: OR1, OR2, and OR3, using the following formulas: \({OR}1=\frac{|{S}_{{ref}}\cap {S}_{{select}}|}{|{S}_{{ref}}|}\times 100\%\), \({OR}2=\frac{|{S}_{{ref}}\cap {S}_{{select}}|}{|{S}_{{ref}}\cup {S}_{{select}}|}\times 100\%\), \({OR}3=\frac{|{S}_{{ref}}\cap {S}_{{select}}|}{|{S}_{{select}}|}\times 100\%\). A higher overlap rate indicates a stronger presence of rare cell differences in the selected features. We simultaneously compare scCAD with two individual strategies across all datasets (Supplementary Table 8). To maintain fairness, we keep the number of features for highly variable genes (HVG) and highly informative genes (HIG) the same as the number of features ultimately selected by scCAD. Supplementary Table 8 shows the overlap rates OR1, OR2, and OR3 between the reference gene set and the results of three gene selection strategies across all datasets. Supplementary Fig. 3a shows the distribution of the overlap rate OR1 between genes selected by three strategies and the reference genes of rare cell types across all datasets.As shown in Supplementary Table 8 and Supplementary Fig. 3a, the overlap rate OR1 reveals that, on average, 86.75% of genes in the reference gene set are present in the genes selected by scCAD, while the corresponding average rates for HVG and HIG are 67.80% and 80.20%, respectively. This indicates that when selecting the same number of genes, scCAD can effectively preserve the majority of rare cell type-specific genes. Due to the significant difference in the number of genes between \({S}_{{\mbox{ref}}}\) and \({S}_{{\mbox{select}}}\), the values of OR2 and OR3 show low relative to that of OR1. For the metric OR2, the average values for HVG, HIG, and scCAD are 2.34%, 2.74%, and 2.99%, respectively. For the metric OR3, the average values for HVG, HIG, and scCAD are 2.39%, 2.77%, and 3.02%, respectively. These results further illustrate that the gene set selected by scCAD contains a sufficient presence of reference genes.To further analyze the potential impact of clustering accuracy on the reliability of genes selected by the random forest model, we first investigated the Adjusted Rand Index (ARI) of clustering results used for the model (Supplementary Fig. 4). In Supplementary Fig. 4, we find that the clustering results by Louvain indeed exhibit lower accuracy in some datasets, with ARIs around 0.2.By combining Supplementary Fig. 4 and Supplementary Table 8, we find that the accuracy of the clustering results has a minor impact on the genes selected by scCAD. Using the Chung dataset as an example, even though the ARI is only 0.16, the genes selected by both the RF model and scCAD still encompass over 75% of the rare cell type-specific DE genes. This observation can be attributed to the inherent tendency of most rare cells of the same type to cluster together. Additionally, compared to the sole utilization of the RF model, scCAD demonstrates better robustness due to its combination of two feature selection strategies. Using the Goolam dataset as an example, while the genes selected by the RF model cover only 28% of the rare cell type-specific DE genes, scCAD’s selection encompasses 61% of these DE genes.Decomposition effectively isolates clusters dominated by rare cell typesClusters are commonly annotated based on the primary gene expression patterns of their containing cells, which represent the characteristics of the most dominant cell type. For each cluster, we first count the number of cells of different cell types contained in the cluster based on the annotation information. Then, we identify the cell type with the highest cell count as the dominant type within the cluster. The occupy rate of the dominant cell type in cluster i is defined as follows: \({P}_{i}=\frac{\max ({N}_{i,1},\,{N}_{i,2},\ldots,{N}_{i,t})}{{N}_{i}}\), where \({N}_{i,j}\) is the number of cells of type j in cluster i, t is the total number of cell types contained in cluster i, and \({N}_{i}\) is the total number of cells in the cluster i. A higher rate serves as an indicator of increased cluster purity, implying that more cells within the cluster belong to the same cell type. For one cell type j in one dataset, the proportion of cell type can be calculated as \({{\mbox{mean}}}({P}_{j,1},{P}_{j,2},\ldots,{P}_{j,{l}_{j}})\), where \({P}_{j,x}\) is the occupy rate of the cell type j in the dominated cluster x and \({l}_{j}\) is the number of clusters dominated by cell type j. Subsequently, for each dataset, we separately average proportions of all cell types and rare cell types. To demonstrate the improvement, we compare the average proportions of the clusters from M-clusters with those from I-clusters across all datasets (Supplementary Table 9). For a more intuitive representation, we visually present the comparison results of rare types and all types across all datasets (Supplementary Fig. 3b and Supplementary Fig. 5), respectively.After cluster decomposition and merging, it becomes evident that the average proportion of cell types within their dominant clusters has significantly increased, especially for rare types, with an average increase from 0.283 to 0.704. Notably, in almost half of the datasets, the initial clustering process fails to identify any rare cell types. In addition, in Supplementary Fig. 3b and Supplementary Table 9, we observe that the average proportions of rare cell types and all cell types in M-clusters are almost higher than those in I-clusters. But we can find from Supplementary Fig. 3b that neither I-clusters nor M-clusters contain clusters dominated by the unique rare cell type present in the data in the Pollen dataset. The reason for the poor results may be due to the poor separability of this type and almost all methods can not identify the rare cell type in the dataset, which can be found in Supplementary Table 2.To further explore the reliability of cluster decomposition, we investigate the distribution of cells from rare cell types across multiple clusters identified by scCAD. Specifically, using annotation information from the original studies of the datasets, we assess the distribution of cells from all involved rare cell types across clusters at both the initial stage (I-Clusters) and the final stage (M-Clusters).For I-Clusters containing m-clusters, we first calculate the proportion \({p}_{t,i}\) of cells of the rare cell type t in cluster i relative to all cells of type t as follows: \({p}_{t,i}=\frac{{n}_{t}^{i}}{{n}_{t}}\times 100\%\), where \({n}_{t}^{i}\) is the number of cells for type t in cluster i, and nt is the total number of cells for type t in the dataset. It is clear that \({\sum }_{i=1}^{m}{p}_{t,i}=100\%\), where m is the number of clusters in I-Clusters. Then, we calculate the proportion \({q}_{t,i}\) of cells of the rare cell type t in cluster i relative to all cells in cluster i as follows: \({q}_{t,i}=\frac{{n}_{t}^{i}}{{n}_{i}}\times 100\%\), where \({n}_{t}^{i}\) is the number of cells for type t in cluster i, and \({n}_{i}\) is the total number of cells in cluster i. We also calculate these two proportions, \({p}_{t,i}\) and \({q}_{t,i}\), in each cluster from M-Clusters.We sort the proportions \(\left\{{p}_{t,1},{p}_{t,2},\ldots,{p}_{t,m}\right\}\) in descending order and select the top ten clusters. Then, we conduct a joint analysis of the \({p}_{t,i}\) and \({q}_{t,i}\) of these selected clusters. Supplementary Figs. (6–10) shows the comparison of \({p}_{t,i}\) and \({q}_{t,i}\) across the selected clusters from I-Clusters and M-Clusters. As shown in Supplementary Figs. (6–10), the majority of cells (the average of \({p}_{t,i}\) is 88%) from rare cell types are found in the same cluster obtained by Louvain at the first stage (I-Clusters) across almost all datasets. Moreover, distinguishing rare cell types from other types during the initial clustering proves to be relatively challenging, with a lower median proportion relative to clusters (I-Clusters, the median of \({q}_{t,i}\) is 18%). After decomposition and merging, the majority of cells (M-Clusters, the average of \({p}_{t,i}\) is 79%) from rare cell types remain in the same cluster. Simultaneously, the proportion of cells for the same rare type relative to clusters significantly increases (M-Clusters, the median of \({q}_{t,i}\) is 81%). Using the analysis results of the Cao dataset as an example (Supplementary Fig. 6), we find that the cells of the rare cell type are distributed across six clusters. Approximately 69% of these cells of rare type are concentrated in the first cluster. The remaining 31% of rare cells are distributed across the other five clusters. After decomposition and merging, the proportion of rare cells in the first cluster is slightly reduced to about 56%, but this cluster exclusively comprises cells of this type (\({q}_{t,i}\) = 100%).Overall, although rare cell types may be distributed across multiple clusters, scCAD can effectively isolate the majority of cells for almost all rare cell types in one cluster, which lays the foundation for the subsequent identification of rare cell clusters.Evaluation of robustness and sensitivity of scCADTo analyze the robustness and sensitivity of scCAD with respect to the number of differentially expressed (DE) genes, We conduct tests using an artificial scRNA-seq dataset and a Jurkat scRNA-seq dataset. The artificial scRNA-seq dataset comprises 2500 cells and two cell types, with the minor cell type representing approximately 1% of the total population. Further details regarding the generation of this dataset can be found in the “Methods” section. The Jurkat dataset consists of an equal-proportion in vitro mixture of 293T and Jurkat cells33. This dataset has been utilized in several previous studies12,13,14,15,34 to simulate the rare cell phenomenon by adjusting the proportion of Jurkat cells. We generate a subsampled dataset by adjusting the proportion of Jurkat cells to 1%. For both datasets, we set aside the pre-identified differentially expressed (DE) genes which are selected through a stringent criterion, and retain all the non-DE genes in the dataset. Additional details about the identification of DE genes and non-DE genes in both datasets can be found in the “Methods” section.Based on the computational efficiency of the algorithm, we compare scCAD with three rare cell detection algorithms: FiRE, GapClust, and GiniClust3. During each iteration of the experiment, an equivalent number of non-DE genes are substituted with randomly selected pre-identified DE genes. This process is repeated 10 times for each number of DE genes. The average F1 score across iterations of different methods is compared for each count of DE genes (Supplementary Fig. 11).As shown in Supplementary Fig. 11, all methods struggled to detect the rare cell type with only a few DE genes, consistent with previous studies12,15. However, scCAD progressively segregates cells of rare types from clusters through iterative clustering, thereby reducing its reliance on differential genes and potentially capturing rare cell types with low signals more effectively. As a result, with the introduction of more DE genes, scCAD’s performance improved significantly, enabling more precise identification of rare cell types compared to its competitors, especially. FiRE and GapClust require more DE genes to achieve a similar stable prediction result. Among them, only GapClust can achieve the identification accuracy of scCAD when an adequate number of DE genes are utilized. GiniClust3 achieves stability with a similar number of DE genes as scCAD in the Jurkat dataset. However, compared to scCAD, its predictive performance is lower, with an F1 score of approximately 0.6. Additionally, its performance on simulated data is poor. In summary, scCAD excels even in scenarios with weak differential expression signals among cell types, enabling precise identification of rare cell types and highlighting its robustness.scCAD enables the identification of rare airway epithelial cell typesThe airways of the lungs are a prominent site for diseases such as asthma, where rare cells play pivotal roles in maintaining airway function35. Montoro et al.36 utilized scRNA-seq to examine the cellular composition and hierarchy of mouse tracheal epithelium, providing the expression profile of 7193 cells. They discovered seven cell types, including two rare ones: the Foxi1+ lung Ionocyte and Goblet cells. The rare Ionocyte in human bronchi they detected using RNA fluorescent in situ hybridizationWe apply scCAD to identify rare airway epithelial cell types. t-distributed Stochastic Neighbor Embedding (t-SNE) serves as a visualization tool for observing the distribution of the cells from various annotated cell types and the distribution of rare cells predicted by scCAD (Fig. 3a, b). The visualization result of Uniform Manifold Approximation and Projection (UMAP)37 is shown in Supplementary Fig. 24.Fig. 3: Visualization analysis of scCAD’s results in airway epithelial.a The t-SNE-based 2D embedding of the cells with color-coded identities. Ionocytes and Goblet cells are specifically marked with circles. b The three rare cell clusters detected by scCAD are visually distinguished using different colors. c Violin plots showing the expression distribution of the most differentially up-regulated genes in each identified cell cluster. Additionally, seven annotated cell types reported by Montoro et al. are used for comparison. Genes within the same cell cluster are indicated with the same color. d The expression of all genes differentially up-regulated in cluster R1 is examined across all cell types, including cluster R1 itself. Source data are provided as a Source Data file.scCAD identifies a total of three rare cell clusters, denoted as R1 (0.42%), R2 (0.26%), and R3 (0.57%) (Fig. 3b). To verify the true identity of these identified clusters, we first obtain the rare cell type annotation information from Montoro et al.‘s original study. Then, we compare the expression of differentially up-regulated genes in the annotated rare cell types with those in the identified cell clusters. Specifically, we use Wilcoxon’s rank sum test to identify differentially up-regulated genes with an FDR cutoff of 0.05 and an inter-group fold-change cutoff of 1.5 for each cluster and each annotated cell type, separately. Assume that \({S}_{i}\) is the set consisting of differentially up-regulated genes in identified cluster i, and \({S}_{j}\) is the set consisting of differentially up-regulated genes in annotated cell type j. The Jaccard similarity coefficient between these two gene sets can be calculated as \({J}_{i,j}=\frac{\left|{S}_{i}\cap {S}_{j}\right|}{|{S}_{i}\cup {S}_{j}|}\). We calculated the similarity between all identified clusters and annotated cell types (Supplementary Table 10). For better visualization, we use the top 10 differentially up-regulated genes for each cluster and compare the identified rare cell clusters with annotated cell types based on the expression distribution of these genes (Fig. 3c). As shown in Supplementary Table 10 and Fig. 3c, clusters R2 and R3 correspond to Ionocytes and Goblet cells, respectively. These two cell types, as indicated in Montoro et al.‘s annotation, encompass only 0.90% and 0.36% of cells in the dataset, respectively. The top 50 differentially up-regulated genes in each cluster are detailed in Supplementary Data 1, and we discover that cells within the R2 cluster exhibit classic Ionocyte markers, such as the transgenic Foxi1-EGFP, the V-ATPase-subunit gene Atp6v0d2, the cystic fibrosis transmembrane conductance regulator (Cftr) gene, the transcription factor Ascl3, and Smbd1 (formerly known as Gm933)36,38. Cells within the R3 cluster exhibit classic markers associated with Goblet-1, a subset of Goblet cells as given in Ref. 34. This cluster is enriched for the expression of genes encoding the key mucosal protein (Tff1) and secretory regulator (e.g., Lman1l). The visualization results and analysis of other methods are given in Supplementary Fig. 12 and Supplementary Note 1, demonstrating that only scCAD can accurately and simultaneously identify Ionocyte and Goblet cells.In contrast to the other two clusters, cluster R1, which consists of 30 cells annotated as Club cells, does not have a corresponding annotated cell type. We visualize the expression of all genes that are specifically up-regulated in cluster R1 across both cluster R1 and all other cell types (Fig. 3d). As shown in Fig. 3d, these genes do not show significant expression in other cell types. Interestingly, we note that R1 shares striking similarities with the “hillock” cells identified by Montoro et al. in their analysis of cell differentiation trajectories. These rare transitional cells connect Basal to Club cells through the unique expression of Krt13 and Krt439. Deprez et al. described a population of Krt13+ cells in the turbinates, indicating that hillock cells may also exist in other regions of the human respiratory tract40,41.scCAD identifies Foxi1+ pulmonary ionocytes, hillock cells, and goblet-1 cells, all of which are confirmed by Montoro et al.36 through immunostaining. Specifically, they confirm that ionocytes are a newly identified cell population in vivo using transgenic Foxi1-EGFP reporter mice and Foxi1 immunoreactivity. They observe immunofluorescence on epithelial tissues, infer trajectories of cell differentiation, and validate the existence of hillock cells. They identify unique Tff2+ goblet-1 cells by immunostaining.scCAD identifies various rare cell subpopulations within the mouse brainIn general, the identification of rare cell types becomes more challenging as the dataset encompasses a larger number of cell types, particularly in datasets with multiple cell subtypes7. To demonstrate the effectiveness of scCAD in identifying rare cell subtypes in such datasets, we utilize an existing scRNA-seq dataset including 20,921 cells located in and around the hypothalamic arcuate-median eminence complex (Arc-ME)42. This dataset, as indicated in the original annotation, encompasses 36 cell subtypes, with 20 of them being considered rare cell subtypes, accounting for proportions ranging from 0.038% to 0.884%. t-SNE is applied to visualize the distribution of the cells (Fig. 4a). The visualization result of UMAP is shown in Supplementary Fig. 24. For a more intuitive comparison, the cells belonging to rare cell subtypes are color-coded to represent their respective identities in the t-SNE-based 2D embedding (Fig. 4b). scCAD identifies a total of seven rare cell clusters, denoted as R1 (0.87%), R2 (0.63%), R3 (0.40%), R4 (0.50%), R5 (0.12%), R6 (0.11%), and R7 (0.17%) (Fig. 4c). Due to the small number of significantly differentially expressed genes identified in this dataset, we utilize all differentially expressed genes rather than just the up-regulated ones. The Jaccard similarity coefficients between the sets of differentially expressed genes for each cell cluster and each cell type are shown in Supplementary Table 11. For better visualization, we select the top 3 differentially expressed genes for each cluster and compare the identified rare cell clusters with annotated cell subtypes based on the expression distribution of these genes (Fig. 4d). As shown in Supplementary Table 11 and Fig. 4d, clusters R1~R7 identified by scCAD are highly similar to the seven minor cell subtypes reported by the original study, respectively. Among them, cells within the R1 cluster exhibit gene expression patterns similar to the rare cell subtype annotated as s27.oligodendrocyte642. The differentially expressed genes in each cluster are detailed in Supplementary Data 2, and we discover that the expression of several characteristic markers in R1 is associated with a subtype of oligodendrocyte known as NFO (newly formed oligodendrocytes)43. NFO represents a distinct stage of oligodendrocyte differentiation. Cluster R1 shows characteristic markers including Fyn. Additionally, it shows high expression of Gpr17 44, which is involved in oligodendrocyte differentiation, and epigenetic factors such as Sirt2, which are also highly transcribed in NFO. Clusters R2 and R7, which include markers such as Dcn, Sparc, and Igfbp745,46,47, show a high degree of similarity to two distinct fibroblast subtypes. Cluster R3 shows a high degree of similarity to Astrocytes, including markers Sparcl1, Slc1a3, Slc1a2, Slc6a11, Glul, and Apoe48,49. Cluster R4 shows a high degree of similarity to a subtype of pars tuberalis type 1C, including marker Cyp2f2. Cluster R5 shows a high degree of similarity to mural cells, including markers myosin light polypeptide 9 regulatory (Myl9), and myosin light polypeptide kinase (Mylk)50. Cluster R6 closely matches a subtype of neurons from the retrochiasmatic area that highly expresses the Oxt gene. Additionally, the visualization results and analysis of other methods are given in Supplementary Fig. 13 and Supplementary Note 2, showing that only scCAD can accurately identify the greatest number of rare cell subtypes without any misidentifications.Fig. 4: Visualization analysis of scCAD’s results in mouse brain.a The t-SNE-based 2D embedding of the cells with color-coded identities. b The t-SNE-based 2D embedding of the cells. The cells in rare cell subtypes are color-coded to indicate their identities. c The seven rare cell clusters identified by scCAD are visually distinguished using different colors. d Violin plots showing the expression distribution of the most differentially expressed genes in each identified cell cluster. Additionally, 36 annotated cell types reported by Campbell et al. are used for comparison. Genes within the same cell cluster are indicated with the same color. Cell clusters that have been identified, along with their corresponding cell subtypes, are marked with an asterisk of the same color. Source data are provided as a Source Data file.scCAD identifies various rare cell types in the crypts of the irradiated mouse intestineThe intestinal epithelium contains various rare cell types, including tuft cells and enteroendocrine cells51. Ayyaz et al. conducted scRNA-seq to profile the regenerating mouse intestine and discovered a distinct quiescent cell type called revival stem cell (revSC)52, which is induced by tissue damage. They validate the rarity of this cell type by using single-molecule fluorescence in situ hybridization (smFISH) for Clu expression in non-irradiated small intestines. Whether it is possible to concurrently detect rare cell types, radiation-induced cell types, and revSCs in enriched crypts after irradiation (IR) is an interesting problem. To solve this problem, we utilize scCAD to analyze an existing scRNA-seq dataset containing 6644 single-cell transcriptomes of isolated crypts52. Ayyaz et al. reported a total of 19 cell clusters. Among them, the 9th and 10th clusters correspond to Enteroendocrine cells, the 18th cluster corresponds to newly discovered revSC, and the 19th cluster corresponds to Tuft cells.scCAD identifies a total of six rare cell clusters, denoted as R1 (0.90%), R2 (0.50%), R3 (0.63%), R4 (0.56%), R5 (0.21%), and R6 (0.69%) (Fig. 5a). Ayyaz et al. did not annotate the real cell types in this dataset and only provided the most differentially expressed genes for each cluster they reported. Therefore, we calculate the Jaccard similarity coefficient between the set of differentially expressed genes for each identified cell cluster (R1~R6) and each reported cluster (Cluster1~Cluster19) as provided by Ayyaz et al. (Supplementary Table 12). For better visualization, we use the top 10 differentially up-regulated genes for each reported cluster and compare the identified rare cell clusters with reported clusters on the expression distribution of these genes (Fig. 5b).Fig. 5: Visualization analysis of scCAD’s results in mouse intestine.a The t-SNE-based 2D embedding of the cells. The rare cell clusters identified by scCAD are visually distinguished using six distinct colors. b Expression of the top 10 differentially up-regulated genes from four reported clusters in the cell clusters identified by scCAD. c The expression of differentially up-regulated genes in cluster R2 across all cells. Source data are provided as a Source Data file.As shown in Supplementary Table 12 and Fig. 5b, we find that Cluster R3, R5, and R6 identified by scCAD are similar to the 9th and 10th clusters, corresponding to Enteroendocrine cells. Clusters R1 and R4 are similar to the 19th and 18th clusters, corresponding to Tuft cells and revSC, respectively. The top 50 differentially up-regulated genes in each cluster we identified are detailed in Supplementary Data 3. We discover that corresponding cell type markers, such as Dclk1, Trpm5, Rgs1353, and Chga54, are differentially up-regulated in cells within the R1, R3, R5, and R6 clusters. Cluster R4 exhibits gene expression characteristics similar to revSCs, indicating its potential classification as a rare subtype. Notably, we cannot find any clusters reported by Ayyaz et al. that are similar to cluster R2. Consequently, we conduct a more in-depth analysis of the expression of differentially up-regulated genes in cluster R2 across all cells (Fig. 5c).As shown in Fig. 5c, these genes do not show significant expression in other cells. By querying the PanglaoDB55 database for cell type markers, we get that a substantial portion (21%) of the differentially up-regulated genes in cluster R2 corresponded to macrophage markers, including CD14 and CD6856. Given the potential association of these rare macrophages with radiation exposure, we conduct additional analysis on other differentially expressed genes and identify NCF2, NCF4, CYBB, and CYBA among them. These genes have been observed to exhibit differential expression in the lungs of mice following exposure to IR57. They play a crucial role in macrophage activation and polarization towards the M2 subtype. Furthermore, the presence of these macrophages indicates alterations in the inflammatory profile of the irradiated lung tissue58.scCAD identifies various rare cell types in the human pancreasThe human pancreas comprises various rare cell types such as Epsilon cells59. To evaluate the performance of scCAD, we conduct tests on a dataset of 8569 cells from the human pancreas60. This dataset encompasses 14 cell types annotated in the original study, with 5 of them being considered rare cell types, accounting for proportions ranging from 0.082% to 0.642%. t-SNE is applied to visualize the distribution of the cells (Fig. 6a). The visualization result of UMAP is shown in Supplementary Fig. 24. The cells belonging to rare cell types are color-coded to represent their respective identities (Fig. 6b). scCAD identifies a total of four rare cell clusters, denoted as R1 (0.56%), R2 (0.16%), R3 (0.18%), and R4 (0.33%) (Fig. 6c). The Jaccard similarity coefficients between the sets of differentially up-regulated genes for each cell cluster and each cell type are shown in Supplementary Table 13. For better visualization, we select the top 10 significantly differentially up-regulated genes for each cluster and compare the identified rare cell clusters with annotated cell types based on the expression distribution of these genes (Fig. 6d).Fig. 6: Visualization analysis of scCAD’s results in human pancreas.a The t-SNE-based 2D embedding of the cells is presented, with color-coded identities indicating cell types. b Cells belonging to rare cell types are also color-coded. c The rare cell clusters identified by scCAD are visually distinguished using four distinct colors. d Violin plots showing the expression distribution of the most differentially expressed genes for the four identified cell clusters. Genes within the same cell cluster are indicated with the same color. e The expression of differentially up-regulated genes in beta cells and cells belonging to cluster R1. Source data are provided as a Source Data file.As shown in Supplementary Table 13 and Fig. 6d, we find that clusters R2, R3, and R4 correspond to Epsilon cells, Schwann cells, and Mast cells, respectively. The top 50 differentially up-regulated genes in each cluster are detailed in Supplementary Data 4. We identified distinctive markers associated with these rare cell types within the differentially up-regulated genes, including GHRL61, NGFR, SOX1062, KIT, and HDC63. In contrast to these clusters, cluster R1, which consists of 48 cells annotated as Beta cells, does not have a corresponding annotated cell type. By further examination, we observe differential up-regulated genes within R1 and compare these cells with other cells belonging to the Beta cell type (Fig. 6e).As shown in Fig. 6e, these genes do not show significant expression in other Beta cells. We find that the cells in R1 represent a variant of the Beta cells described by Baron et al.60 This variant is characterized by variable expression of genes associated with Beta cell function, such as HERPUD1, HSPA5, and DDIT364, which are involved in endoplasmic reticulum stress response. Baron et al. pointed out that further work is required to characterize this beta cell variant. The visualization results and analysis of other methods are given in Supplementary Fig. 14 and Supplementary Note 3. The visualization results clearly show that only scCAD can identify the rare Epsilon cells.scCAD can identify known rare cell types in large-scale immunological single-cell datasetsTo assess scCAD’s ability to detect rare cell types and subtypes in larger single-cell datasets, we collect two immunological datasets separately. One dataset contains 73,259T cells from 8 human donors65, and the other contains 39,563 gastrointestinal immune cells from 10 Crohn’s disease patients66. Both of them are well-annotated by original studies and comprehensive, making the identification results of scCAD more interpretable. We use t-SNE to visualize cell distribution for both datasets (Fig. 7a, b), with color-coding cell subtypes to show their identities. The visualization result of UMAP is shown in Supplementary Fig. 24. To visualize the rare cell types in both datasets, we highlight cell types containing less than 1% of the cells in the T cell dataset (Fig. 7c) and immune cell dataset (Fig. 7d).Fig. 7: Visualization analysis of scCAD’s results in two large-scale immunological single-cell datasets.a The t-SNE-based 2D cell embedding with color-coded identities for cell types in the T cell dataset. CTL cytotoxic T cells, TCM T Central Memory, TEM T Effector Memory, dnT double-negative T, gdT gamma-delta T, Treg regulatory T. b The t-SNE-based 2D cell embedding with color-coded identities for cell types in the immune cell dataset. DC Dendritic Cell, ILC innate lymphoid cells, T(gd) gamma-delta T, TFH T follicular helper, Tregs regulatory T cells, TRM tissue-resident memory T cells. c Cell types comprising less than 1% of cells in the T cell dataset are color-coded. d Cell types comprising less than 1% of cells in the immune cell dataset are color-coded. e The rare cell clusters identified by scCAD are visually distinguished using two and four distinct colors on the T cell dataset. f The rare cell clusters identified by scCAD are visually distinguished using two and four distinct colors on the immune cell dataset. Source data are provided as a Source Data file.scCAD identifies two rare cell clusters in the T cell dataset (Fig. 7e): R1 (0.21%) and R2 (0.22%). R1 primarily consists of two types of proliferating cells, CD4 and CD8, with very few annotations in the dataset (0.15% and 0.12% respectively). R1 is mainly composed of double-negative T cells (dnT), which are relatively rare in humans and mice (1~5% of all T cells)67. In the immune cell dataset, scCAD identifies four rare cell clusters (Fig. 7f): R1 (0.29%), R2 (0.26%), R3 (0.27%), and R4 (0.07%).Cluster R1 predominantly consists of mast cells, R2 predominantly consists of pericytes and smooth muscle cells, R3 predominantly consists of lymphocytes, and R4 predominantly consists of glial cells. It’s worth noting that these cell types are the top five rarest annotated in this data.scCAD identifies various unannotated rare cell subtypes in the clear cell renal cell carcinoma datasetRenal cell carcinomas (RCCs) are a diverse group of malignancies believed to originate from kidney tubular epithelial cells. Various RCC subtypes exhibit a broad range of histomorphology, proteogenomic alterations, immune cell infiltration patterns, and clinical behaviors. The most prevalent subtype is clear cell renal cell carcinoma (ccRCC). We collected a total of 6046 cells annotated into 26 cell clusters from benign adjacent kidney tissues (6 samples from 5 patients) and a total of 20,748 cells annotated into 13 cell types from 7 ccRCC samples68. Both of them are utilized to assess the effectiveness of scCAD in the complex tumor microenvironment. The cell type annotation information originates from their original studies. As the visualization results of t-SNE are less discriminative for cell types in these two datasets, we visualize the datasets and their respective annotated rare cell types using the 2D UMAP embedding results (Fig. 8a, b, d, e). The visualization results of t-SNE are shown in Supplementary 20.Fig. 8: Visualization analysis of scCAD’s results in clear cell renal cell carcinoma dataset.a UMAP-based 2D visualization depicts cells from the benign kidney, with distinct cell types represented by different color codes. b Cell types comprising less than 1% of cells are color-coded. c The rare cell clusters identified by scCAD are visually distinguished using twelve distinct colors. d UMAP-based 2D visualization depicts cells from the ccRCC, with distinct cell types represented by different color codes. e Cell types comprising less than 1% of cells are color-coded. f The rare cell clusters identified by scCAD are visually distinguished using seven distinct colors. g–i Comparing the expression of differentially expressed genes in the identified rare cell cluster and other cells annotated as the same type, from left to right: R4 (g), R5 (h), R7 (i). AEA-DVR afferent/efferent arterioles/descending vasa recta, AVR ascending vasa recta, CNT connecting duct, DCT distal convoluted tubule, DL descending limb, GC glomerular capillaries, IC intercalated cells, PC principal cells, Macro macrophages, Mono monocytes, NK natural killer cells, Peri pericytes, Podo podocytes, PT Proximal tubule, tAL thin ascending limb, TAL thick ascending limb, ua unanalyzed, UC uncharacterized, vSMC vascular smooth muscle cells, Endo endothelial. Source data are provided as a Source Data file.In the benign kidney data, scCAD identifies a total of 12 rare cell clusters (0.26%~0.86%) (Fig. 8c). Upon comparing the detailed annotation information, we discover that the dominant cell types of these clusters encompass multiple rare cell types. For instance, cluster R5 primarily consists of B cells, while R9 is mainly composed of mesangial cells. Notably, scCAD identifies two rare proximal tubule (PT) cell subtypes reported by previous studies68,69, namely PT-B (R12) and PT-C (R1). Zhang et al. confirmed the presence of these two subtypes of cells using RNA in situ hybridization (RNA-ISH) on independent benign kidney tissue samples with select markers.In the ccRCC data, scCAD identifies a total of 7 cell clusters (0.10%~0.56%) (Fig. 8f). In addition to CD8+ T cells (R1, R6), mast cells (R2), and plasma cells (R3) annotated as rare cell types, scCAD also identifies three rare cell clusters (R4, R5, and R7). By further examination, we observed differential up-regulated genes within these clusters and compared these cells to other cells of the same annotated type (Fig. 8g–i).Cluster R4 is annotated as T cells. The top 50 differentially up-regulated genes in these three cell clusters are detailed in Supplementary Data 5. We find that cells in R4 should belong to a rare subtype of effector CD4+ T, named CD4+ effector-GNLY, characterized by high expression of genes associated with cytotoxicity, including NKG7, GZMB, GZMH, and GNLY, as given in a previous study70.Cluster R5 cells are initially annotated as macrophages. However, we identify multiple markers for dendritic cells, such as CD1C, CD207, and FCER1A71. Interestingly, Kaplan-Meier analyses of the top 10 differentially expressed genes in R5 reveal an association between high expression levels and increased overall survival in ccRCC (TCGA-KIRC). As shown in Supplementary Fig. 15, high expression of these genes is a positive survival indicator, suggesting that the rare cluster identified by scCAD may provide valuable prognostic information for ccRCC patients.Cluster R7 comprises 81 cells from the 239 cells annotated as “ua” (Unanalyzed). However, we observe that its differentially expressed genes are all related to hemoglobin, including AHSP, HBD, and HEMGN, indicating that this rare cell cluster may be related to hemoglobin synthesis or related biological processes. From the list of highly expressed genes (in reads per kilobase per million transcripts) for each stage of erythroid differentiation72, we conclude that cells in cluster R7 are likely in the polychromatic erythroblast stage. Supplementary Fig. 16 illustrates individual cells color-coded on a 2D embedding plot derived from UMAP, reflecting the RNA expression levels of different marker genes. Overall, scCAD not only accurately identifies rare cell subtypes but also proves useful in correcting rare cell type annotation mistakes. Furthermore, it has the great potential to identify disease-related immune cell subtypes, providing insights into disease progression.Comparative performance of scCAD against multi-omics approachThe advancement in sequencing technology facilitates the integrative analyses of different types of single-cell omics data, providing insights that are more comprehensive than those from a single type of single-cell omics data73. This has the potential to enhance downstream analysis performance. However, this progress also presents challenges, including the introduction of noise due to batch effects among different omics data74. We conduct a comparison between scCAD solely based on scRNA-seq data, and MarsGT24, which integrates both scRNA-seq data and single-cell ATAC sequencing (scATAC-seq) data.Specifically, we first conduct a comparison between scCAD and MarsGT on four real datasets (PBMC-bench-1, 2, 3, and PBMC-test) obtained from human peripheral blood mononuclear cells, which coincide with the datasets used by MarsGT. scCAD solely utilizes the scRNA-seq data in each dataset, and the specific details of these datasets can be found in Supplementary Table 1. We present the performance of scCAD and MarsGT in identifying rare cell types on these datasets, as measured by F1 score, precision, and recall (Supplementary Table 14). As shown in Supplementary Table 14, scCAD demonstrates slightly superior performance compared to MarsGT in terms of F1 score and recall, particularly noticeable in the independent test dataset (PBMC-test), which is the dataset primarily used by MarsGT to illustrate its performance. Upon re-examination of these four datasets, we ascertain that they originate from a common dataset totaling 69,249 cells, with each dataset representing a distinct batch and displaying remarkably similar cell type distributions. In Supplementary Table 14, scCAD exhibits greater stability (F1 score standard deviation of 0.1101) compared to MarsGT (0.1747). This difference may result from the effects of technical variations and noise often encountered in the integrative analyses of diverse single-cell omics data types. Further analysis of the identification results of scCAD on these four PBMC datasets can be found in Supplementary Note 4. The visualization analyses of scCAD’s results in these PBMC datasets are shown in Supplementary Figs. 17–20. Jaccard similarity coefficients between the sets of differentially up-regulated genes for each identified cell cluster and each real cell type in these datasets are detailed in Supplementary Tables 17–20. The genes that are differentially up-regulated in the identified cell clusters across these datasets are detailed in Supplementary Data 7–10. According to the cell type annotation information from their original studies, we find that scCAD not only identifies diverse minor cell types but also uncovers unannotated subtypes. Furthermore, scCAD consistently identifies the same minor cell types across datasets, showcasing its potential for analyzing multiple batches of datasets.Then, we test whether scCAD, using solely scRNA-seq data, could identify rare cell types in the two single-cell Multi-omics datasets employed in MarsGT’s case studies. The two datasets consist of 9383 cells from the mouse retina75 (Retina dataset) and 14,148 cells obtained from a flash-frozen intra-abdominal lymph node tumor (B_lymphoma dataset) (Supplementary Table 1).In the Retina dataset, MarsGT reported 12 rare cell clusters, comprising one amacrine cell (AC) cluster, seven bipolar cells (BC) clusters, one horizontal cell (HC) cluster, two Müller glia cell (MG) clusters, and one Rod cell cluster. In contrast, scCAD identifies more rare cell clusters (R1~R19, totaling 19 clusters). According to the annotations provided in [75], these clusters correspond to two AC clusters (R10, R13), one HC cluster (R8), four Rod cell clusters (R7, R9, R17, R18), six BC clusters (R1, R3, R11, R12, R14, R16), and six retinal ganglion cell (RGC) clusters (R2, R4, R5, R6, R15, R19). Given that BC populations are known to encompass numerous rare populations, we further investigated six clusters associated with BC. We visualize the expression of marker genes specific to the BC subpopulation across the six BC clusters (R1, R3, R11, R12, R14, R16) and the 10 BC subtypes annotated in [75] (Fig. 9a).Fig. 9: Visualization analysis of scCAD’s results in mouse retina dataset and human lymphoma dataset.a Violin plots showing the expression distribution of the known marker genes related to BC subtypes across the six identified BC clusters and annotated 10 BC subtypes. b The Pearson correlation heatmap between the 6 identified BC clusters and the 10 BC subtypes, is calculated based on the average expression values of BC marker genes. c The expression of enriched marker genes from 40 RGC subtypes is examined across all RGC-related clusters (R2, R4, R5, R6, R15, R19). d The t-SNE-based 2D embedding of the cells with color-coded identities in the lymphoma dataset. e The five rare cell clusters detected by scCAD are visually distinguished using different colors. BC bipolar cells, GEX gene expression, Mono monocytes, NK natural killer Cells, NKT natural killer T Cells, pDC plasmacytoid dendritic cells, Treg regulatory T. Source data are provided as a Source Data file.For better visualization, we compute the Pearson correlation coefficients between the six BC clusters and the 10 BC subtypes based on the average expression values of these marker genes and present them in a heatmap (Fig. 9b). As shown in Fig. 9a, b, we observe that these clusters correspond to distinct BC subtypes, particularly R3, which represents the rarest BC subtype (BC10), accounting for only 3% of all BCs75, and was not identified by MarsGT. Additionally, RGCs also exhibit multiple subtypes76, prompting us to further analyze the six RGC clusters identified by scCAD.Rheaume et al.77 classify RGCs into 40 subtypes and validate the markers of these subtypes in purified RGCs by fluorescent in situ hybridization (FISH) and immunostaining. We compile a total of 115 uniquely enriched marker genes from 40 RGC subtypes reported by Rheaume et al. (Supplementary Table 15). We visualize the expression of these enriched marker genes across all RGC-related clusters (R2, R4, R5, R6, R15, R19) in Fig. 9c, and we find that these clusters represent various RGC subtypes. Notably, MarsGT only identified a major RGC cluster.In the B_lymphoma dataset, MarsGT reported a rare state named B lymphoma-state-1. t-SNE is utilized to visualize the cell distribution in the lymphoma dataset (Fig. 9d). scCAD identifies a total of five rare cell clusters (R1~R5) (Fig. 9e). The Jaccard similarity coefficients between the sets of differentially up-regulated genes for each cell cluster identified by scCAD and each cell type annotated by 10X Genomics are presented in Supplementary Table 16. According to Supplementary Table 16, these clusters include one Mono/T mix cluster (R2, 0.27%), one plasmacytoid dendritic cells (pDC) cluster (R3, 0.29%), and one Stromal cell cluster (R5, 0.74%). The top 50 differentially up-regulated genes in each cluster are detailed in Supplementary Data 6. We identified distinctive markers associated with these rare cell types within the differentially up-regulated genes, including CD16378, IL3RA79, and CALD180. Unlike the other clusters, neither cluster R1 (0.35%) nor cluster R4 (0.13%) has a corresponding annotated cell type. Through the analysis of their differentially expressed genes in Supplementary Data 6, we conclude that they likely correspond to gamma-delta T cells and mucosal-associated invariant T (MAIT) cells, as indicated by the up-regulated expression of marker genes CENPF81 and KLRB182, respectively. In contrast to scCAD, MarsGT did not identify these rare cell types.Moreover, scRNA-seq data is more readily accessible, thereby streamlining the data acquisition and processing workflow and reducing experimental costs. In summary, scCAD holds advantages in performance, stability, and cost-effectiveness.scCAD effectively identifies well-validated dendritic cell subtypesDendritic cells (DCs) play a central role in pathogen sensing, phagocytosis, and antigen presentation. DCs are one of the rarest types of immune cells, constituting only 1–2% of peripheral blood mononuclear cells (PBMCs)83. Villani et al.79 identified six distinct subtypes of dendritic cells (DCs) by analyzing their expression profiles using fluorescence-activated cell sorting (FACS). They validated the existence of these subtypes by flow cytometry.To further test the reliability of the rare clusters identified by scCAD, we apply scCAD to the widely used ~68k PBMC dataset and investigate whether any dendritic cell subtypes not captured in the original annotation could be identified. This dataset encompasses 11 cell types annotated in the original study, accounting for proportions ranging from 0.14% to 30.29%. t-SNE is applied to visualize the distribution of the cells (Fig. 10a).Fig. 10: Visualization analysis of scCAD’s results in ~68k PBMC dataset.a The t-SNE-based 2D embedding of the cells is presented, with color-coded identities indicating cell types. Reg regulatory, NK natural killer Cells. b The rare cell clusters identified by scCAD are visually distinguished using four distinct colors. c The Pearson correlation heatmap compares the two identified dendritic cell (DC) clusters, all annotated DCs, and six validated DC subtypes, based on the average expression values of common DC marker genes across the two datasets. d The expression distribution of the top marker genes related to six DC subtypes across the two DC clusters and all annotated DCs. Source data are provided as a Source Data file.scCAD identifies a total of four rare cell clusters, denoted as R1 (0.50%), R2 (0.24%), R3 (0.13%), and R4 (0.12%) (Fig. 10b). Upon comparing the original annotation, we find that R2 consists of megakaryocytes, a type that makes up only 0.4% of the entire dataset. Additionally, R1 and R4 mainly consist of DCs annotated in the original study, while R3 mainly consists of CD19+ B cells. The top 50 differentially up-regulated genes in these three cell clusters are detailed in Supplementary Data 11.To explore the true identities of these two DC clusters (R1 and R4), we calculate the correlation between their average expression and that of all well-validated DC subtypes on the same marker gene set. First, we construct a gene set using the top 50 markers for each DC subtype reported by Villani et al.79 Due to differences between datasets, the filtered gene set consists of 245 markers. Next, we calculate the average expression of this gene set for each DC subtype in the Villani et al. dataset. Simultaneously, we calculate the average expression of this gene set for R1, R4, and all DCs in the ~68k PBMC dataset. Finally, we compute the Pearson correlation coefficient between them (Fig. 10c).As shown in Fig. 10c, we observe that between the two datasets, clusters R1 and R4 show the highest similarities to DC subtypes DC1 and DC6 (pDC), with similarities of 0.8 and 0.74, respectively, significantly higher than those of other subtypes. Violin plots illustrate the expression distribution of the top markers for each DC subtype across R1, R4, and all DCs (Fig. 10d). Clusters R1 and R4 exhibit significant expression of top markers belonging to DC subtypes DC1 and DC6. By combining Fig. 10c, d, we can confidently determine cluster R1 mapping to CLEC9A+ DCs and cluster R4 to pDCs. By further examination, we observed differential up-regulated genes within R1 and R4 and compared these cells with other DCs (Supplementary Fig. 21a). This highlights their rarity in the dataset.Cluster R3 cells are initially annotated as CD19+ B cells. Supplementary Fig. 21b compares the expression of differentially up-regulated genes in R3 with other B cells. However, we identify multiple markers for plasma cells, such as CD27 (TNFRSF17), MZB1, DERL3, ITM2C, and IGLL584. Furthermore, other studies85,86 have also reported the rarity of plasma cells in this dataset, thus validating our findings.

scCAD: Cluster decomposition-based anomaly detection for rare cell identification in single-cell expression data

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis