Accurate identification of locally aneuploid cells by incorporating cytogenetic information in single cell data analysis

Single-cell RNA sequencing (scRNA-seq) has greatly improved our ability to understand the cellular composition of the tissues and organs of interest, identify phenotype-associated cell groups, and elucidate the mechanisms behind many biological processes1,2,3. These advantages make scRNA-seq a powerful tool for studying a wide range of human diseases, including Alzheimer’s disease4, cardiovascular disease5, and cancer6. In cancer research, a crucial step of scRNA-seq data analysis is to delineate tumor cells or neoplastic cells from other cell types7,8. The tumors of each patient have their unique tumoral and microenvironmental evolution, and thus the scRNA-seq data from cancer patients tend to be more heterogeneous. Such heterogeneity is an exciting opportunity for improving our understanding of cancer with scRNA-seq, but it also imposes computational challenges to dissect composing cell types9.Neoplastic cells are abnormal cells that are undergoing excessive and uncontrolled proliferation10. These cells, which may or may not be malignant, can be extracted experimentally through cell sorting, although this is not always possible due to a lack of suitable markers or the high cost and labor requirements associated with these experiments11. In fact, it may be more useful to study all cells from a sequencing experiment simultaneously in order to understand the characteristics of neoplastic cells within their surrounding microenvironment. Neoplastic cells in certain types of cancer often have distinct features compared to non-neoplastic cells, such as high expression of certain cell type markers or genes belonging to some oncogenic pathways12. However, identifying neoplastic cells based on these markers or pathways can be difficult due to inter-individual heterogeneity, technical artifacts, and noise from the tumor microenvironment13.Recently, computational methods have been developed to identify large-scale copy number variations (CNVs) by comparing the smoothed scRNA-seq data against an internal or external normal reference, such as inferCNV, HoneyBADGER, and copyKAT3,14,15. For example, inferCNV is a popular visualization method for identifying large-scale CNVs. It uses smoothed averages over gene windows and compares the expression magnitude to the average over a set of reference ‘normal’ cells. CopyKat is a recently developed tool serving a similar purpose15. CopyKAT uses an integrative Bayesian segmentation approach combining CNV inference and hierarchical clustering, which has been shown to achieve high accuracy in distinguishing cancer cells from normal cells in multiple cancer types. Both inferCNV and CopyKAT generally work well with tumor cells that demonstrate extensive chromosomal alterations, but they do not work well for cancer types that have fewer and shorter CNVs. This is often the case in hematologic cancers such as myelodysplastic syndromes and acute myeloid leukemia16. Moreover, they couldn’t incorporate additional clinical information for detecting specific CNVs. There are also methods that sought to integrate scRNA-seq data with bulk DNA sequencing (DNA-seq) or single-cell DNA-seq (scDNA-seq) data, such as CONGAS, clonealign, and CCNMF17,18,19. These methods serve a different purpose: to cluster cells based on CNV or mutation information.In this paper, we propose to exploit cytogenetic information to improve the sensitivity and specificity for CNV identification. Cytogenetic data are routinely measured and recorded for patients with hematologic cancers20 and they provide useful information to identify CNVs21. In the case of myelodysplastic syndromes, cytogenetic features, along with other factors such as morphology, immunophenotype, and clinical features, are included in the World Health Organization (WHO)-classification-based Prognostic Score System (WPSS) for myelodysplastic syndromes22 and its revised version23. Similar risk scoring systems also exist for other types of hematologic malignancies24,25. For example, Leukemia patients with certain cytogenetic features, such as deletion of chromosome 7 or 7q, deletion of 3q, or amplification of chromosome 8, have been shown to have a poor prognosis26. Cytogenetic data provide location information of each CNV and the proportion of cells with specific CNVs based on the analysis of 20 metaphases. For example, if a patient has cytogenetic data as “46,XY,del(20)(q11.1q13.1)[5]/46,XY[15],” this means that approximately \(25\%\) percent of cells (5 out of 20) have a deletion of chromosome 20 in the region q11.1 to q13.1, while the rest of the cells have normal chromosomal features. Cytogenetic data are typically cheaper and more readily available in clinical settings compared to DNA-seq or scDNA-seq experiments. While the proportion in cytogenetic data is a crude estimate of the aberrant cells, they can still be useful in classifying cell status and identifying cells with chromosomal abnormalities, which may be markers for neoplastic cells27. None of the existing computational methods is able to incorporate such cytogenetic information in the analysis of scRNA-seq data.Here, we introduce two methods, partCNV and partCNVH, for identifying cells with regional chromosomal abnormalities from scRNA-seq data by integrating cytogenetic information. Both methods are based on a statistical framework that models the count expression matrix of scRNA-seq data using a mixture of Poisson distributions while incorporating the cytogenetic information through prior specification. PartCNVH is built on partCNV and it further includes a hidden Markov model (HMM) to improve feature selection and clustering accuracy. It should be noted that our proposal is complementary to the existing methods such as copyKAT and inferCNV, as they focus on identifying large-scale CNVs while we detect smaller variations with the incorporation of external information. We implement our proposed methods in a computationally efficient expectation-maximization (EM) algorithm28 and evaluate their performance through extensive simulation studies. We then apply them to three scRNA-seq data sets from patients with hematologic malignancies and show that they can identify cells with chromosomal deletions or amplifications in specific regions suggested by the cytogenetic data. We also perform additional analysis to understand the changes in the pathways and biologic processes in the identified aneuploid cells. Compared to existing methods, partCNV and partCNVH provide more interpretable results and additional findings. With the widespread use of single-cell technology in hematologic cancer research and clinical care of cancer patients, our methods offer a useful solution for fully leveraging cytogenetic data to identify cells with specific chromosomal abnormalities.Fig. 1Schematic of PartCNV and PartCNVH. With the input of normalized expression counts from scRNA-seq experiments and the cytogenetic information from the patient, we develop an EM algorithm with mixtures of two Poisson distributions to infer the aneuploid/diploid status for the regions of interest. We further include a hidden Markov model to improve feature selection and the classification accuracy.Method overviewPartCNV is a statistical framework that uses a hierarchical Poisson mixture model to differentiate two mixture components corresponding to normal and aberrant cells. PartCNVH is an extension of partCNV with the addition of HMM when there is a sufficient number of genes that allow feature selection. Figure 1 provides a schematic overview of the proposed methods. Our methods start with the normalized expression counts from the region with a known chromosomal deletion or amplification and explicitly incorporate the prior knowledge from cytogenetic data through imposing a Bernoulli prior on the cell status (i.e., normal or aberrant). We develop an EM algorithm that treats cell status as the missing variable and efficiently solves the mixture model. The inferred cell status from this step is the output of partCNV.Taking the ouptut from partCNV, partCNVH further refines it by a HMM. Specifically, a group average is taken for the inferred two groups of cells and the rolling average of the ratios between the two groups is used to infer the hidden status of the regions by a HMM. There are two reasons that we adopt this combination of rolling average and HMM in partCNVH. First, as shown in our later results, the group mean and rolling average can effectively magnify the signal of the regional deletion or amplification on the expression level. This is especially important when the signal of copy number alternations is weak related to noise in gene expression measurement. Second, it is possible that only a subset of the regions of interest has copy number changes. HMM can identify regions that are more likely to contain the chromosomal changes, which in turn improves the performance of aneuploid/diploid cell classification. After this HMM-based feature selection step, partCNVH performs a second round of the EM algorithm using the Poisson mixture model and reports the inferred cell status.SimulationsWe design comprehensive simulation settings to evaluate the performance of partCNV and partCNVH. As comparisons, we also consider existing methods using dimension reduction by principal component analysis (PCA) followed by Louvain or Leiden clustering. Previous literature has reported that the Leiden algorithm can generate better connected communities through including an extra refinement step and run faster than Louvain29. Additionally, we include two widely used machine learning clustering algorithms, K-means clustering and hierarchical clustering. All the previous mentioned methods can be applied to detect locally aneuploid cells. Although our proposed method is not directly comparable to existing methods that classify cells based on whole-genome CNV inference, we still design a separate simulation study to compare the proposed methods versus the two large-scale CNV detection-based methods, inferCNV and copyKAT14,15.We consider two settings where the first one studies aneuploid cells with deletions and the second one studies amplifications: redSimulation data 1 and 2. The mean expression for these genes is generated by taking a ratio of the normal expression. This ratio (or log fold change) is randomly drawn from a Uniform distribution with different base levels (0.5/0.6 for Setting 1 and 1.5/1.4 in Setting 2) and different noise levels. A larger noise level makes the expression from the aneuploid cells similar to the normal cells, and thus creates harder scenarios for the methods. The evaluation criteria is the accuracy of the classification results of the proposed method evaluated by the true cell status (i.e., being aneuploid or normal). More details of the simulation settings are provided in the Methods section.Fig. 2Results of simulation Setting 1 with deletions. The methods that are compared include K-means clustering (Kmeans), hierarchical clustering (HClust), dimension reduction using PCA plus Louvain clustering (PCA+Louv), and dimension reduction using PCA and Leiden clustering (PCA+Leid). Each simulation dataset contains a total of 3000 cells: 2500 normal cells and 500 with deletions. (A) The accuracy of these methods when the ratio of gene expression in a deletion region versus normal expression is 0.5 at different noise levels (0.1: low, 0.2: medium, 0.3: high). (B) The results of the setting with ratio = 0.6 at different noise levels. All results are summarized over 100 Monte Carlo iterations.First consider simulation data 1, where 500 out of 3000 cells have deletions. Our proposed methods partCNV and partCNVH have the highest accuracy among all the methods in all scenarios (Figure 2A-B). Using the same normalized gene expression counts input as our methods, K-means and hierarchical clustering have the lowest accuracy ranging from 0.5 to 0.7. PCA plus Louvain and Leiden have higher accuracy than K-means and hierarchical clustering. When the signal is strong (ratio = 0.5 in panel A) and noise is small, PCA plus Louvain/Leiden also have similar high accuracy as the proposed methods. But with the increase of the noise level, the accuracy of PCA plus Louvain or Leiden decreases. The advantage of the proposed methods becomes more obvious when the ratio is 0.6 and the noise level is high. For example, the mean accuracy of PCA plus Leiden is around 0.75 while partCNVH can achieve a high accuracy of 0.9. This is understandable since the proposed methods specifically model the data through two components for normal and aneuploid cells, and they allow mixtures of regions with and without deletions. Second, partCNV and partCNVH have similarly good performance with accuracy higher than 0.9, and partCNVH generally has higher accuracy than partCNV. To better understand the role of the HMM step of partCNVH and the result of feature selection, we use one simulation data set as an example and visualize the mean gene expression across the region (Figure 3A), ratios of the mean expressions of the two groups inferred by partCNV (Figure 3B), the rolling average of the mean expression ratios (Figure 3C), and the inferred status from HMM (Figure 3D). It can be seen that the rolling average of mean expression ratios between the two groups can effectively magnify the signal, and a majority of the HMM selected genes are located in the region with deletion. Figure 3E shows that our proposed method partCNVH has greater accuracy in classifying cells than the other methods.Fig. 3Illustrating the procedure of the feature selection step using HMM in a simulation data set. (A) The log-transformed mean gene expression levels for the genes located in regions without and with deletion. Each dot is a gene. (B) The log-transformed ratio of the mean expression levels for cells without versus with deletion inferred by partCNV. (C) The expression mean ratio for the two groups of cells by partCNV after applying the rolling average with a bandwidth of 50. (D) The latent states inferred by HMM based on the rolling average from panel (C). (E) The results of classification accuracy of partCNVH, PCA plus Leiden, hierarchical clustering, and K-means clustering. The red dots are the cells incorrectly classified and the blue dots are the correct ones.Evaluation of PartCNV & PartCNVH for different prior information: amplification regionsNext we evaluate different methods in simulation data 2 with amplifications (Figure 4). Note that for cells with deletion, the expression change odds is 2 (from 1 to 0.5) while in cells with amplification the odds is 0.67 (from 1 to 1.5) or its inverse 1.5. Thus the signals from the amplified regions can be harder to detect than the first setting with deleted regions. In Figure 4(A), we observe the performance of all methods decrease, especially for PCA plus Louvain and Leiden. When the noise level is high (0.3), PCA plus Leiden only reaches an accuracy of 0.74. In comparison, our proposed method still has a high accuracy of around 0.95. With the ratio level set as 1.4 in panel B, the signal level becomes lower and the existing methods have even lower accuracy ranging from 0.5 to 0.6, while the proposed methods still stay at a reasonable accuracy level around 0.9. These demonstrate the robustness of the proposed methods and highlight the importance of applying partCNV or partCNVH instead of existing methods when the region of interest has amplifications.Evaluation of PartCNV & PartCNVH for different prior information: cell numbersOur current simulation design considers a total of 3000 cells. When we study a region of interest suggested by cytogenetic data, more cells generally provide more information, and thus identifying signals from fewer cells can be more challenging. To evaluate the proposed methods under this scenario, we generate simulation data 3 by fixing the cell number as 1300, where 1000 cells are normal and 300 are aneuploid cells. Figure S1 shows the simulation results with 1300 total cells and the region of interest (200 out of 600 genes) has a deletion in the aneuploid cells. Compared with the results in Figure 2, all the existing methods have worse performance for the same ratio and noise combinations. For example, both PCA plus Louvain and PCA plus Leiden have a high accuracy of around 0.90 when the ratio is 0.6 with a medium noise of 0.2 using 3000 total cells, while the accuracy decreases to around 0.8 using 1300 cells. The variation of the classification results also increases. Although our proposed methods have slightly decreased performance when the ratio is 0.6 and the noise is 0.3, their overall performance remains similar in other scenarios (ratio = 0.5 at all noise levels, ratio = 0.6 with low and medium noise levels).Fig. 4Results of simulation Setting 2 with amplifications. Each simulation dataset contains a total of 3000 cells: 2500 normal cells and 500 with amplifications. (A) The accuracy of the methods compared when the ratio = 1.5 at different noise levels (0.1: low, 0.2: medium, 0.3: high). (B) The results when the ratio = 1.4 at different noise levels. All results are summarized over 100 Monte Carlo iterations.Similar patterns can be observed in amplification settings by comparing the results in Figure S2 versus the results in Figure 4. Surprisingly, PCA plus Louvain or Leiden has even worse median accuracy than the hierarchical clustering, even though they all have quite low classification accuracy. These results suggest that our proposed methods tend to have more robust performance even with fewer cell numbers in the analyzed dataset, while the existing methods have decreased accuracy and more varied results, especially when the noise level is high.Fig. 5Simulation results of evaluating the impact of different prior information on the classification accuracy of the proposed methods. The dark and light blue bars correspond to the results with and without correct specification of the prior information (true prior: 0.17). Panel (A) and (B) are the simulation results based on the first simulation setting with ratios = 0.5, 0.6, respectively. Panel (C) and (D) are based on the second simulation setting with ratios = 1.5, 1.4, respectively. All results are summarized over 100 Monte Carlo iterations.Evaluation of PartCNV & PartCNVH for different prior information: proportions of aneuploid cellsAs a methodology advantage, partCNV and partCNVH are able to incorporate the prior knowledge of an estimated proportion of aneuploid cells. If the prior is misspecified, we seek to understand the impact on the results. We generate simulation data 4 with the same data generation procedure but the prior information is specified as correct (0.17 for total cell number 3000 and 0.23 for total cell number 1300) and incorrect (0.5), and we examine the results of our proposed methods.Figures 5 and S3 illustrate the accuracy for the total cell numbers 3000 and 1300. First, it is clear that the correct prior knowledge improves the classification accuracy than a non-informative prior of 0.5. This improvement is small when the ratio is 0.5 or 1.5, but it can be substantial when the signal is harder to detect (ratio = 0.6 or 1.4) and the noise level is high. For example, when the ratio is 1.4 and the noise is 0.3, the improvement of the accuracy for both partCNV and partCNVH using a correct prior can be about \(10\%\) compared to using the incorrect prior. Second, both Figures 5 and S3 demonstrate the robustness of the proposed methods against incorrect prior information, especially in panels A and C where the ratios are 0.5 and 1.5, respectively. In these experiments, we choose 0.5 as the incorrect prior knowledge, which is far from the true proportion 0.17 and illustrate a worst scenario that the prior is completely non-informative. In reality, when a closer prior such as 0.20 or 0.15 is used, the impact would be much smaller. Even in the worst scenario, with an amplification ratio 1.4 and a high noise level 0.3, our method with an incorrect prior still reaches a median accuracy above 0.75, which is better than the existing methods under the same scenario. These results highlight the advantage of the proposed methods in accurately identifying aneuploid cells.Comparison with genome-wide CNV detection methods Lastly, we compare the proposed methods with two widely used genome-wide CNV detection methods, inferCNV, copyKAT as well as the regional versions of inferCNV and copyKAT. For the regional versions, we applied hierarchical clustering on the normalized expression matrix by inferCNV/copyKAT for the regions of interest only. One example of simulation data 4 is visualized in Figure 6A. As inferCNV requires the input of normal cells, we use an additional 100 normal cells as the reference for inferCNV. Both inferCNV and copyKAT generally are much more computationally intensive. We summarize the accuracy of 20 Monte Carlo simulations. It can be observed that neither inferCNV/copyKat nor the regional versions of these methods is able to accurately infer aneuploid/diploid cell status (Figure 6B-D), since the regional aneuploid signal is too weak compared to genome-wide copy number alterations that inferCNV and copyKAT are designed to detect. It is also interesting that when we re-cluster the cells based on the normalized expression from the sub-regions of chromosome 20, the performance of copyKAT increased but not inferCNV. These findings highlight the need for region-specific detection tools with the considerations of cell type mixtures to distinguishing between aneuploid and diploid cells.Fig. 6Simulation results to compare the proposed method versus existing whole-genome based methods, InferCNV and CopyKAT. (A) Heatmap of the simulated expression values for genes in region Chr20(q11.1-q13.1), where the rows are the cells and columns are the genes. Logarithmic of the expression values plus one are used for visualization. Rows are labeled by the true aneuploid/diploid status and the inferred status by partCNV and partCNVH. (B) CopyKAT output of the aneuploid/diploid prediction versus the true cell status. The heatmap includes all the chromosomes. (C) Copy number results using InferCNV. (D) Boxplot of the aneuploid/diploid inference of the methods averaged over 20 Monte Carlo simulations. Regional InferCNV and Regional CopyKat are the hierarchical clustering results based on the normalized expression of the region of interest from InferCNV and CopyKAT.Real data applicationWe demonstrate the usage of the proposed methods on three real data applications. For each application, the scRNA-seq data from one patient were collected by the 10X genomics platform and the cytogenetic data were collected from patients’ medical records. All three patients have a subset of the cells with regional copy number variations, and the rest are normal cells. The three applications have different complexity levels. The first subject (patient 1) is the most straightforward one, as a very long region (the whole chromosome Y) was reported as lost in a subset of the cells according to the patient’s cytogenetic data. The second subject (patient 2) has a subset of cells with partial chromosome 20 lost. The third subject (patient 3) has a complicated situation as this patient has cells with partial deletions, as well as cells with partial amplifications.Patient 1: MDS with loss of chromosome YWe obtain the scRNA-seq data of a bone marrow sample from an MDS patient treated at MD Anderson Cancer Center. The data, after alignment and quality control, contains a total of 33,538 genes and 655 cells. For this patient, the bone marrow sample has been specifically sorted for CD34+ cells to enrich hematopoietic stem and progenitor cells (HSPCs). As a result, the cell number is smaller than regular scRNA-seq experiments. Based on the clinically obtained cytogenetic data, this patient has around \(35\%\) of cells with the loss of chromosome Y and our goal is to identify these aneuploid cells. We first apply copyKAT to the whole transcriptome data from this sample to infer copy number variations. As shown in Figure S4, copyKAT clusters the cells based on the inferred CNV statuses across the whole genome, but it could not take regional data or the cytogenetic information into consideration when identifying the neoplastic cells.Fig. 7Results of applying different methods to the scRNA-seq data using the bone marrow CD34 positive cells from patient 1. Cytogenetic information shows that \(\sim 35\%\) of the cells from the sample have chromosome Y loss. (A-E) The cell classification based on the inference of chromosome Y loss using different methods. Panel F shows the heatmap of the ARI values from comparing the classification results using different methods. (G) The expression of chromosome Y genes for cells that are labeled as normal in partCNV (partCNV:0) and aneuploid in CopyKAT (CopyKat:1), and three similar groups. (H,I) The expression of chromosome Y genes comparing partCNV versus k-means clustering and comparing partCNV and Hierarchical clustering.To specifically identify cells with the chromosome Y loss, we apply the proposed methods and the five existing methods (PCA plus Louvain, PCA plus Leiden, K-means clustering, hierarchical clustering, and CopyKAT) on this dataset. The input data are the normalized counts from the genes located on chromosome Y. In this application, the HMM step from partCNVH selects the whole set of genes so we only present the results from partCNV. Since the CNV in this dataset encompasses the entire chromosome Y, we expect some of the existing methods also work well for this analysis. We find that partCNV, PCA plus Louvain, PCA plus Leiden, and K-means clustering all have proportions of aneuploid cells close to 35% (40.9%, 39.5%, and 39.5%, respectively) (Figure 7 A-E). Hierarchical clustering identifies a much smaller number of aneuploid cells (9.31%). From visualizing the pairwise ARI values of these results, we find that partCNV, PCA plus Louvain/Leiden have very similar results, while K-means clustering and hierarchical clustering have very different results (Figure 7 F). As the UMAP coordinates in Figure 7(A-E) are obtained using the whole transcriptome data, the fact that copyKAT-identified aneuploid cells cluster together suggests that copyKAT captures the whole transcriptome pattern instead of chromosome Y specific changes.We further examine the average gene expression of chromosome Y genes among the aneuploid/normal cells identified by partCNV, copyKAT, K-means, and hierarchical clustering (Figure 7 G-I). It is apparent that partCNV-labeled aneuploid cells have much lower expression than the cells labeled as aneuploid by other methods but normal per partCNV, confirming that partCNV correctly identifies the cells with deletion on chromosome Y.In summary, these results suggest that partCNV and the PCA plus Louvain or Leiden clustering have identified the cells with the chromosome Y loss. In contrast, K-means clustering, hierarchical clustering, and copyKAT failed to do so.Patient 2: MDS with partial deletion of chromosome 20We obtain the scRNA-seq data from the bone marrow sample for a different MDS patient. The data were also generated by the 10X genomics scRNA-seq technology. This sample was sequenced directly without the cell sorting step, and thus both HSPC and immune cells can be potentially identified. After alignment, preprocessing, and quality control, a total of 24,519 genes and 3,643 cells are kept for the analysis. Based on the cytogenetic data, about 20% of cells in the sample have deletions in chromosome 20 at regions q11.1 to q13.1, which is about 24.2 Mb long. We also apply CopyKAT to this data and present the heatmap result in Figure S5. Although cytogenetics reported deletions in chromosome 20, the log copy number ratio heatmap does not have an obvious deletion pattern in the suggested regions. Based on the whole genome copy number inference, copyKAT only reports about 500 aneuploid cells (\(\sim\)5.5%).Fig. 8Application results of the proposed method and existing methods using the scRNA-seq data from patient 2. (A) The cell type annotation based on marker genes’ expression and biological knowledge. RBC progenitor: red blood cell progenitors. HSPC: hematopoietic stem and progenitor cells. NK: natural killer. (B-E) The inferred aneuploid cells with deletion chr20(q11.1q13.1) using different methods (B, partCNV; C, PCA+Louvian; D, K Means clustering; E, Hierarchical clustering). The proportions in the title bracket are the proportions of cells that are inferred as cells with this specific deletion. (F) The comparisons of results by partCNVH and PCA plus Louvain versus cell type labels. (G) Gene set enrichment analysis results for the genes that are differential expressed between the cells with and without deletion chr(20)(q11.1q13.1). Gene sets are defined using the Reactome Pathway Database.We analyze the whole-transcriptome data using Seurat2 through identifying highly variable genes, extracting top principal components (PCs) based on these genes, and we perform UMAP and clustering analysis (Figure 8). UMAP is used for dimension reduction and unsupervised clustering is performed with the default Louvain clustering using the top PCs. A total of 10 clusters are identified, and the cluster specific markers are used for annotating the cell type labels based on biological knowledge by our MDS biologist. We also apply the proposed methods and existing methods targeting the region on chromosome 20 with known chromosomal deletions. Figure 8 A-E shows cell type labels and the aneuploid/diploid inference result using partCNVH, PCA plus Louvain, K-means clustering, and hierarchical clustering. The results for partCNV and PCA plus Leiden are presented in Figure S6. We find that the proposed methods have the closest proportion of aneuploid cells to the cytogenetics reported proportion; the other methods all have much lower or higher proportions. The major difference between our proposed method and PCA plus Louvain is in the cycling RBC progenitors (Figure 8 B, C, and F). Our method reports high proportions in all three MDS-related cell groups (i.e., HSPCs, RBC progenitors, and cycling RBC progenitors), while PCA plus Louvain only reports aneuploid cells in the former two clusters. Previous literature found impaired erythroid-proliferating capacities to be a prominent characteristic in patients with MDS30,31. Both RBC progenitors and cycling RBC progenitors are major cell types involved in the erythroid-proliferating function, and thus it makes sense to identify neoplastic cells in both cell types.To understand the differences between the identified locally aneuploid cells and the normal cells, we conduct differential expression analysis for each cluster to compare the aneuploid versus diploid cells identified by our method32. A total of 177 cluster-specific differentially expressed genes were identified using a cutoff of 0.05 for adjusted p values. We also perform over-representation analysis using GO Biological Process database33,34 and identify several functional categories over-represented by differential expression signals, including neutrophil degranulation, a few immune system related terms, and interleukin 4 and interleukin 13 signals (Figure 8 G). Many of these terms have been reported in previous literature to be related with MDS. For example, neutrophil degranulation and migration has been reported to be associated with MDS compared with normal controls35. The important role of the immune system and innate immune signaling in MDS has also been reported in multiple publications36,37,38. The over-representation results using the Reactome pathway and Hallmark database are presented in Figure S739,40. In the Reactome pathway enrichment results, we also find several immune response-related and lymphocyte activation related terms. In Hallmark analysis, allograft rejection, complement, interferon gamma response, and TNFA signaling via NFKB are the top findings, which also have been associated with MDS pathogenesis41,42,43. There are also some terms that have not been reported in previous MDS studies, such as generation of second messenger molecules, hemostasis, defense response, and cell activation, which could be promising targets for future research.Patient 3: AML with partial gain and partial deletion of chromosome 8Lastly, we study the scRNA-seq data from an AML patient with complicated chromosomal variations44. Specifically, this patient has amplifications of the whole chromosome 8 in 25% of the cells, deletion of chromosome 8 at region q21.2 to q24.3 in 40% of the cells, and normal karyotype in the rest of the cells. We are interested in identifying which cells contain the chromosome 8 gain and which have del(8)(q21.2q24.3). The scRNA-seq data were generated using the 10X genomics technology platform. After preprocessing, the data contain a total of 20,521 genes from 4294 cells. Since there are overlaps between the deletion and amplification regions, we apply partCNVH through a two-step approach. We first focus on the region of chromosome 8 before q21.2 where about 25% of the cells have amplification; the rest of the cells are normal in the area. After the cells with chromosome 8 gain are detected, we apply partCNVH again to the rest of the cells, which contain del(8)(q21.2q24.3) in around 53.3% (\(= 40\%/(1 – 25\%)\)) of the cells. In the two steps, \(25\%\) and \(53.3\%\) are used as the prior knowledge of the aneuploid cell proportions in partCNVH.Fig. 9Results of applying the proposed method partCNVH to the scRNA-seq data from an AML patient with \(\sim 25\%\) cells having amplification of chromosome 8 and \(\sim 40\%\) cells having chromosome 8 deletion at region q21.2-q24.3. (A) The true cell type labels annotated by a clinician-scientist. (B) The inferred cells with amplification of 8 (“+8”) and the deletion of chromosome 8 at q21.2-q24.3 (“del(8)(q21.2q24.3)”). (C, D) The Reactome pathway enrichment analysis results using the DEGs by comparing the AML cells with del(8)(q21.2q24.3) versus diploid cells, and by comparing the erythroid cells with +8 versus diploid cells, respectively.We find that the majority of the inferred cells with chromosome 8 amplification are from the erythroid cells, while the cells with del(8)(q21.1q24.3) are mostly AML cells (Figure 9 A-B). The proportions of cells with chromosome 8 amplification and del(8)(q21.1.q24.3) are about 13% and 60%, which are close to the prior knowledge from cytogenetics. We visualize the normalized expression for the region of interest on chromosome 8 in Figure S8A. We also present the results by inferCNV and copyKAT in Figure S8B and Figure S9, respectively. We observe that the patterns for chromosome 8 varies between the three cell groups in Figure S8A. However, whether the pattern shows gain or loss is not easy to identify due to data sparsity. Neither CopyKAT nor InferCNV can be applied to regional data, and thus they couldn’t reproduce the patterns we observed in Figure S8A.To understand the different molecular mechanisms related to chromosome changes, we perform differential analysis for comparing the AML cells with del(8)(q21.2q24.3) versus diploid AML cells and obtain 266 differentially expressed genes (DEGs). Similarly, we compare the erythroid cells with a gain of chromosome 8 versus diploid erythroid cells and obtain 426 DEGs. Gene set enrichment analysis identify a few shared significant terms between AML and erythroid cells (Figure 9C-D), including multiple immune system related terms, neutrophil degranulation, and cellular responses to stimuli. Some unique terms in AML are hemostasis, platelet activation signaling and aggregation, and arachidonic acid metabolism. Hemostatic and thrombotic complications are prevalent symptoms in AML patients and hemostasis has been studied before for AML pathogenesis-related mechanisms45. The term platelet activation signaling and aggregation is also consistent with the previous literature that the platelet defects and other hemorrhagic symptoms are widely observed in AML patients46. The arachidonic acid metabolism is a process highlighted in a few cancer research publications, but the evidence for their involvement in AML is still accumulating47,48. Overall, our results are consistent with literature and provide some novel disease-related biological processes for future research.

Hot Topics

Related Articles