Robust analysis of allele-specific copy number alterations from scRNA-seq data with XClone

Overview of XClone suiteXClone offers a comprehensive end-to-end pipeline for CNA analysis using scRNA-seq data. This pipeline takes bam file(s) as its input, models raw UMI or read counts, and outputs both smoothed BAF and RDR matrices for visualization, as well as the detected cell-by-gene CNA states as final results. It also requires users to designate a diploid cell group as a reference. This process comprises four main stages: pre-processing, pre-analysis, modeling, and post-analysis. These stages are thoroughly explained below and illustrated in (Fig. 1a).Preprocessing: extract count matrices with xcltkTo facilitate the pre-processing of single-cell data for XClone, we have developed a toolkit called xcltk, which can be found at https://pypi.org/project/xcltk/. This toolkit is designed to extract cell-by-gene count matrices from one or more BAM files, serving both RDR and BAF modules (as illustrated in Fig. 1b). The input can either be a single BAM file with embedded cell barcodes (like those produced by 10x Genomics) or multiple BAM files with each file corresponding to a single cell (such as those generated by Smart-seq2). By default, xcltk uses gene annotations from GENCODE (32,696 genes for hg19 and 33,472 genes for hg38), aligning with the approach taken by Cell Ranger. However, users have the flexibility to use any custom gene (or other feature) annotations.To produce generate allele-specific counts (upper panel in Fig. 1b), xcltk necessitates a Variant Call Format (VCF) file that contains phased germline genotypes from population haplotype (see next subsection), along with the standard lists of genes and cells. The tool pinpoints all heterozygous variants and calculates UMIs or reads supporting each allele. Consequently, this process generates two matrices that present the alternative allele (AD) and depth (DP) for each SNP within each cell. Making use of genotype phasing information, xcltk can merge multiple SNPs from a single gene and condense both AD and DP into cell-by-gene matrices. It is important to highlight that xcltk avoids double-counting when a UMI or read spans multiple variants within a single gene. This is accomplished by preserving the UMI or read IDs for each SNP and ensuring they are only counted once during the aggregation of all SNPs.In terms of read depth count generation (lower panel in Fig. 1b), xcltk computes the UMIs or read counts for each gene in every cell, resulting in a sparse matrix. The process is highly comparable to the approach adopted by Cell Ranger, a tool developed by 10x Genomics, which includes filtering of reads (by default, reads with MAPping Quality (MAPQ) < 20, aligned_length < 30nt or Combination of bitwise flags (FLAG) > 255 are filtered out).Pre-analysis: three-level allele phasing for haplotype-specific BAFIn XClone, in order to enhance the signal-to-noise ratio when analyzing the allele information in the BAF module, we utilized a three-level allele phasing approach. First, we implemented reference-based phasing (Fig. 1b) by utilizing the human population haplotype reference to phase individual SNPs. This phasing was accomplished by EAGLE224 with the Haplotype Reference Consortium panel25, which was powered by the Sanger imputation server (https://imputation.sanger.ac.uk) as the default option. Once the individual SNPs were phased, xcltk was able to count the phased alleles for SNPs across a gene (as described in Section “Preprocessing: extract count matrices with xcltk”).Second, to avoid the errors in the population-based phasing when it comes to a longer genome range covering many genes, we further performed a gene-bin phasing (local phasing; Fig. 1c) to link a group of proximate and consecutive genes (we call it a gene bin, by default 100 genes). Intuitively, in each cell, we force the allele frequency AF to be the same for all genes in a gene bin, allowing flipping the allele of each gene. Namely, it assumes the CNA states are the same within a gene bin in a cell, even though the CNA states can be different from cell to cell (e.g., some are normal cells and some are tumor cells). Similar to CHISEL6, we implemented an EM algorithm (Supplementary Algorithm 1) to achieve this by flipping the allele of each gene within a gene bin.Third, for synchronizing the BAF for visualization, we introduced a chromosome-range phasing (global phasing Fig. 1c) to further link all gene bins on each chromosome arm. Similar to the gene-bin phasing, this step can also use the genomic auto-correlation on the CNA states, which could be addressed by a dynamic programming algorithm, a HMM, or a Gaussian process model. As this step is relatively independent of CNA state identification, we adopted the dynamic programming algorithm proposed in CHISEL for simplicity and efficiency (see Supplementary Algorithm 2). Note, both the gene-bin and chromosome-range phasing steps depend on the auto-correlation of BAF deviance from 0.5 (theoretical BAF value for allele balance). Otherwise, if there is no allelic imbalance, it will return arbitrary phasing, which is irrelevant to the CNA state inference. The three-level allele phasing illustration is shown in Supplementary Fig. 1 and the remarkable performance of the phasing of the three datasets was visualized in Supplementary Figs. 3, 7, and 9, for sample BCH869, ATC2 and TNBC1, respectively.Pre-analysis: horizontal and vertical smoothing for visualisationBefore detecting the CNA states, it is beneficial to visualize the raw signal of both RDR and BAF modules. Due to the high technical sparsity in scRNA-seq data, it is necessary to perform smoothing for the visualization. Here, we performed both a horizontal smoothing by a weighted moving average (WMA) along with each chromosome and a vertical smoothing across cells via a KNN graph. Both views of smooths are commonly used in scRNA-seq analysis, for example, WMA is the core smoothing method in InferCNV26 and the KNN smoothing plays an important role in smoothing unspliced RNAs in scVelo for RNA velocity analysis27.Taking the RDR module as an example, the raw count matrix was first normalized to its total library counts, then divided by reference cell expression profile, and finally transformed to the logarithm scale as log ratio. Then this raw RDR matrix X is smoothed horizontally by a WMA connectivity matrix W (via right matrix multiplication) and vertically by a KNN connectivity matrix M (via left matrix multiplication), as follows,$$\tilde{X}=M\times X\times W,$$
(1)
where M is a cell-by-cell connectivity matrix generated from the KNN graph (k = 10 by default) of routine scRNA-seq analysis scanpy.pp.neighbors(), and W is a gene-by-gene connectivity matrix linking each gene to its proximate consecutive genes on each chromosome arm (window size t = 40 by default), as follows$${W}_{j,g}=\frac{{W}_{j,g}^{o}}{{\sum }_{h}{W}_{h,g}^{o}},{W}_{j,g}^{o}=\left\{\begin{array}{ll}t-| d(j;g)|,&0 \, < \, | d(j;g)| \, < \, t\\ t * 2,\hfill &d(j;g)=0\hfill\\ 0,\hfill &| d(j;g)| \,\ge\, t,\hfill\end{array}\right.$$where d(j; g) refers to the ordinal distance from any gene j to a query gene g on genome coordinate. Of note, W is column normalized and M is row normalized, namely with sum to one. Also, both matrices have higher weights for each node itself, hence balancing the denoising and signal preserving.For the B allele frequency (BAF) module, we took the cell-by-gene_bin matrix as input and performed a similar smoothing both horizontally (with WMA, window size t = 101 by default) and vertically (with KNN). As an example, the smoothing steps for BCH869 scRNA-seq were shown in Supplementary Fig. 3. Once the smoothed RDR and BAF matrices are generated, they can serve both visual representation and empirical extraction of CNA parameters (see Sections “Modeling: mixture model for expression data (RDR module)” and “Modeling: mixture model for phased allele-specific data (BAF module” below)).Modeling: mixture model for expression data (RDR module)The RDR module aims to detect the states of absolute copy numbers, i.e., CN-neutral, loss, and gain, for each gene in each cell. To start, it requires specifying reference cells (i.e., diploid cells), which can be picked from pre-annotated cell types from transcriptome by examining markers of non-tumor cells. Given the reference cells, XClone will calculate the reference average gene expression (ref_avg) and filter the genes with low expression (default 0.5 for 10x genomics and 1.8 for smart-seq). Optionally, it will further remove top marker genes for each pre-annotation cell group (15 by default), e.g., as detected by the scanpy.pp.rank_genes_groups().Then, we used a negative binomial distribution to model the observed UMI (or read) counts for each CNA state in a Generalized Linear Model (GLM) framework, as follows$$p({X}_{c,g}| {z}_{c,g,k}=1)= \, {\mathtt{NB}}({X}_{c,g}| {\mu }_{c,g,k},{\phi }_{g})\\ {\mu }_{c,g,k}= \, {l}_{c}\times {X}_{ref,g}\times {C}_{k},$$
(2)
where Xc,g denotes the raw read/UMI count at gene g cell c; lc denotes the cell-specific library size factor; Xref,g denotes the average count value for gene g in given reference diploid cells; Ck denotes the copy number ratio compared to reference diploid state for each CNA state. The latent variable zc,g,k indicates if gene g in cell c belongs to CNA state k (zc,g,k = 1) or not (zc,g,k = 0). In this module, we defined three states with index k = 0 for copy loss, k = 1 for copy neutral, and k = 2 for copy gain.Considering that the reference count values Xref,g can be obtained directly, it leaves only three sets of parameters to be estimated: lc, Ck, and ϕg. Here, we estimated them separately by a maximum likelihood estimation with the statsmodels package28 or using a heuristic method. To estimate the library size factor lc for each cell c, by default, we used the ratio of its total sum count to the average of reference cells. To estimate the gene-specific dispersions ϕg, we fit a negative binomial GLM per gene in the form of Eq. (2) but only on the reference cells (Ck = 1) with the fitted library size lc from the previous step.For the CNA ratio Ck, one can directly use a theoretical value, namely {0.5, 1, 1.5}, respectively, for loss, neutral, and gain (Supplementary Fig. 2). Here, we further employed a heuristic method from the empirical distribution. Based on the smoothed RDR value, we selected 3 guided genomic regions (chromosomes or arms) that are most likely to be copy loss, copy neutral, and copy gain regions, as ranked by the mean. In general, CNA regions only cover a small percentage of the whole cell-by-gene matrix and the extreme ratio value at each CNA state can be used to guide CNA ratio initialization. Thus, the CNA ratio Ck will be extracted from quantiles of each of the corresponding chromosomes (or arms), specifically default quantile 0.96 in copy neutral, quantile 0.0001 in copy loss, and quantile 0.99 in copy gain regions.Once all three sets of parameters are fitted via the above procedures, we can calculate the probability of assigning each gene in each cell to a certain CNA state via Eq. (2). Namely, we can obtain a cell-gene-state probability tensor, as follows$$p({z}_{c,g,k}=1| {X}_{c,g},\, {l}_{c},\, {X}_{ref,g},\, {{{\boldsymbol{C}}}})=\frac{{\mathtt{NB}}({X}_{c,g}| {l}_{c}\cdot {X}_{ref,g}\cdot {C}_{k},\, {\phi }_{g})}{{\sum }_{k}{\mathtt{NB}}({X}_{c,g}| {l}_{c}\cdot {X}_{ref,g}\cdot {C}_{k},\, {\phi }_{g})},$$
(3)
where NB() is the probability density function of negative binomial distribution parameterized by the mean and dispersion. Once the CNA state assignment probability is calculated above, we can have a hard assignment of each state by taking the state with the highest probability. Of note, this probability will be further smoothed cross cells by a kNN smoothing and cross genes by an HMM smoothing (see Section “Post-analysis: CNA states smoothing, combination and denoise” below).Modeling: mixture model for phased allele-specific data (BAF module)For the BAF module, we took both AD and DP sparse matrices from xcltk, removed marker genes provided by the RDR module, and performed the three-level phasing to generate gene_bin based matrices as inputs for XClone modeling. Here, we used a beta-binomial distribution to model the observed UMI (or read) counts for each CNA state, as follows$$p({a}_{c,g}| {d}_{c,g},\, {z}_{c,g,k}=1)={\mathtt{BB}}({a}_{c,g}| {d}_{c,g},\, {\rho }_{g,k},\, {\tau }_{g,k}),$$
(4)
where ac,g and dc,g denote phased B allele count and total count at gene_bin g in cell c, respectively. The Beta-binomial is parameterised by the mean success rate ρ and concentration τ (namely α = ρτ and β = (1−ρ)τ in conventional parameterization). Here, we treat the concentration as a technical factor and shared by all gene_bins in droplet-based data (τg,k = 100 by default) while it can vary in smart-seq data depending on the coverage of each gene_bin.Here, zc,g,k is a latent variable indicating the cell c in gene_bin g is CNA state k (zc,g,k = 1) or not (zc,g,k = 0). The parameter ρg,k denotes the expected allele frequency of gene_bin g in CNA state k. The state index k = {0, 1, 2, 3, 4} is respectively for allele A strong bias, allele A minor bias, allele balance, allele B minor bias, and allele B strong bias, with the corresponding theoretical allele frequencies {0, 1/3, 0.5, 2/3, 1} (Supplementary Fig. 2).The dataset-specific allele frequency parameters ρg,k can also be estimated from the horizontally smoothed BAF matrix to capture more accurate CNA patterns. Specifically, we first estimated the allele frequency for copy neutral state ρg,k=2 from reference cells by directly using its gene_bin-specific mean allele frequency. If there are too few reference cells to accurately estimate the allele frequency, it can be set to 0.5 by default. Next, to estimate the allele frequency of other BAF states k = {0, 1, 3, 4}, we applied a Gaussian mixture model to the smoothed BAF matrix to identify five components with mean values, in order, for allele A bias (strong, minor), allele balance and allele B bias (minor, strong) as an empirical estimation of the CNA ratio for the BAF model. XClone also provides another mode of 3 BAF states (allele A bias, allele balance, and allele B bias) with the corresponding theoretical allele frequencies {0, 0.5, 1} visualization for comparison.When the parameters are optimized above, similar to the RDR module, we can also obtain a cell-by-gene_bin-by-state likelihood tensor by normalizing Eq. (4), as follows,$$p({z}_{c,g,k}=1| {a}_{c,g},\, {d}_{c,g},\, {{{{\rho }}}}_{g,\cdot },\, {{{{\tau }}}}_{g,\cdot })=\frac{{\mathtt{BB}}({a}_{c,g}| {d}_{c,g},{\rho }_{g,k},{\tau }_{g,k})}{{\sum }_{k}{\mathtt{BB}}({a}_{c,g}| {d}_{c,g},{\rho }_{g,k},{\tau }_{g,k})},$$
(5)
where BB() is the probability density function of beta-binomial distribution parameterized by the success rate and concentration. Then, similar to the RDR module above, we can also obtain the hard assignment of the BAF-based CNA state by taking the one with the highest probability, after it gets smoothed.Post-analysis: CNA states smoothing, combination and denoiseInstead of directly using Eqs. (3) and (5) for CNA states assignment at each gene (or gene_bin) in each cell respectively in RDR and BAF modules, we further introduced a smoothing strategy for the state’s assignment. Specifically, for each module separately, we first performed a vertical smoothing via a KNN connectivity matrix, as used in Section “Pre-analysis: horizontal and vertical smoothing for visualisation”, i.e., via the left matrix multiplication with M in Eq. (1). Then we treated it as the input emission probability matrix for horizontal smoothing by a HMM. For HMM, the start probability and transition probability can be customized to capture different resolutions of smoothing. Here, we set start probability {0.1, 0.8, 0.1} (10x Genomics), {0.3, 0.4, 0.3} (smart-seq) for the RDR module and {0.2, 0.15, 0.3, 0.15, 0.2} for the BAF module, and transition probability ({t = 10−6, 1−(K−1)t} respectively for cross-state transition and state keeping, Supplementary Fig. 1g) as default. Then, we employed a forward-backward algorithm in HMM to achieve a maximum likelihood for CNA assignment probabilities along genomic coordinates (see more details about this algorithm in Chapter 13 in ref. 29).To combine the CNA states of RDR and BAF modules, we first need to synchronize feature dimensions, as XClone’s RDR module outputs a cell-by-gene-by-state CNA posterior probability tensor, while BAF has a shape of cell-by-gene_bin-by-state. The default way is to map the BAF gene_bin scale back to the gene scale according to the RDR module. Then, we can combine the same dimensional RDR and BAF CNA posterior probability tensors following the combination strategy in Supplementary Fig. 2 to get the final combined CNA posterior probability tensor.After XClone detects the CNA states for each gene in each cell, a post-denoise step can be optionally applied to clean the regions with a low chance of carrying CNAs. Specifically, for each gene, we calculated the proportion of aneuploid cells and then fit a two-component Gaussian mixture model over the aneuploid cell proportion among all genes. Then the genes assigned to the bottom component (i.e., with the lowest proportion of aneuploid cells) will be taken as neural CNA regions, which hence will be masked for CNA analysis. Alternatively, users can manually set a threshold as cell proportion to denoise the probability matrix if there is other evidence. We apply the post-denoise strategy default on the BAF module and it also can be applied on the RDR and combined module by manually setting it.Data generation, download, preprocessing and analysisThe scRNA-seq data for a fresh gastric cancer tissue sample GX109-T1c was generated by 10x Genomics Chromium 5′-scRNA-seq platform at the Centre for PanorOmic Sciences (CPOS) at HKU. Sample was collected with patient’s informed consent. The study was approved by the Institutional Review Board of the University of Hong Kong and the Hospital Authority Hong Kong West Cluster (IRB reference ID: UW14-257) and complies with all relevant ethical regulations, including written informed consent from the patient for data release. Sufficient details of the experiments are provided in Supplementary Methods.The raw FASTQ files were aligned to the human genome (GRCh38) via Cell Ranger v2.2.0. In total, 5400 cells were called with the same pipeline described in30,31 (Details in Supplementary Methods) and were analyzed with Seurat by selecting the top 2000 highly variable genes and using 50 top principal components, followed by Leiden clustering with resolution 0.1, resulting in 7 clusters (Supplementary Fig. 17). We further manually annotated these clusters according to the cell type marker genes in Supplementary Table S2 from32 and aggregated them into 4 major different cell types: endothelium/fibroblast (endothelium or fibroblast cell), epithelium, immune cells, and stem/cancer cell (gastric stem or cancer cell), respectively with cell counts of 63, 20, 3164 and 2153.The BCH869 data generated using smart-seq2 was downloaded in FASTQ files from GEO (GSE102130) by request from the authors. Within the original BCH869 dataset, there are 489 cells, with four tumor subclones composed of 227, 221, 34, and 7 cells, according to the original study14, 3 normal cells (Oligodendrocytes and immune cells) from the same sample were combined in the dataset as reference cells in our study.ATC2 is an anaplastic thyroid cancer sample with 6224 cells where tumor purity was found extremely low9. TNBC1 is a triple-negative breast cancer sample with 1097 cells (2 subclones in tumor cells)9. The ATC2 and TNBC1 datasets were generated using 3′ 10x Genomics platform by Gao et al.9 and the aligned reads in BAM format and the called cell list were directly downloaded from GEO (GSE148673). The GBM is a second recurrence (2R) IDH1-mutant astrocytoma sample with 4416 cells sequenced by droplet-based snRNA-seq in ref. 18. For more details on the 5 datasets, see Supplementary Data 1.Given the aligned BAM file(s) and cell lists, we performed the genotyping by using cellsnp-lite v1.2.013 and genotype phasing by using the Haplotype Reference Consortium panel25 with EAGLE224, powered by Sanger Imputation Server (https://imputation.sanger.ac.uk). Then, feature counting was achieved by using xcltk v0.1.15 (see above Section “Preprocessing: extract count matrices with xcltk”).Benchmarking CNA detection performanceIt is necessary to benchmark the CNA inference methods using gold standard annotation as ground truth. The CNA profile of the BCH869 sample was well characterized by WGS and scRNA-seq, which is used as ground truth to identify copy number loss, copy number gain, and loss of heterozygosity. The BCH869 dataset has a complex clonal annotation (4 subclones with different CNA events) where the clone-specific CNA annotation was determined from the combination of scRNA-seq data (by InferCNV) and bulk WGS with haplotype analysis (BCH869 CNA ground truth in Supplementary Data 2, Number of consensus cells and genes of five methods for benchmarking on BCH869 in Supplementary Data 3).With ground truth of BCH869 sample, we benchmarked XClone with four other CNAs detection tools specifically designed for scRNA-seq data, InferCNV26, CaSpER11, CopyKAT9 and Numbat12 (XClone v.0.3.4, InferCNV v.1.8.0, CopyKAT v.1.0.4, CaSpER v.0.2.0 and Numbat v.1.2.1) using their default parameters (details in Supplementary Methods and running efficiency in Supplementary Data 4).For CNA detection performance evaluation at the single-cell level, we built a consensus cell-by-gene signal matrix for each tool at each CNA state. For XClone and Numbat, CNA event posterior probability was used to build a cell-by-gene matrix, and for InferCNV, CaSpER, and CopyKAT, the modified expression was used. For Numbat and CaSpER whose output is at genomic region resolution, we mapped them to the gene level. The ground truth matrix is a same-dimensional cell-by-gene binary matrix indicating the existence of CNA events in the consensus cells and genes obtained above. The ground truth matrices are prepared for each state (copy gain, copy loss, and LOH) independently. Ground truth matrix value 1 indicates the CNA exists and 0 means not. By flattening the matrix and varying the cutoff on signal scores, we calculated the true positive rate (TPR) and false positive rate (FPR) between the detected (cell-by-gene signal matrix) and real CNA regions (cell-by-gene binary ground truth matrix) for each tool, further generated a ROC curve and calculated the area under the ROC curve (AUC) values for evaluation. It is worth mentioning that for fair benchmarking with different tools, XClone can output CNAs in two resolutions: (1) gene_scale, where the CNA probabilities are reported for each gene in each cell; (2) chromosome arm_scale, where the CNA probabilities are reported for each arm of chromosomes in each cell. We showed the gene scale benchmarking results in Fig. 2f–h and the chromosome arm scale benchmarking results in Supplementary Fig. 5 for BCH869. Additionally, we also showed the Precision-Recall curves of each method at both the gene scale and arm scale for BCH869 in Supplementary Fig. 6.Further, we take the ATC2 sample with extremely low tumor purity to benchmark the subclone identification performance. We found that inferCNV and CopyKAT both detect 2 subclones obviously (Fig. 3g, h). By Combining expression with BAF analysis, we annotated the non-normal cells clusters into subclone1, and subclone2, which we used as cell annotation for assessing the subclonal detection performance for XClone by calculating the Adjusted Rand Index (ARI) metric (0.998 compared with inferCNV; 0.961 compared with CopyKAT, and 0.963 for inferCNV with CopyKAT, Fig. 3a, bottom panel).For the TNBC1 sample, as a high-quality annotation is lacking and it implies complex CNA profiles, we only used it in a qualitative manner by examining the consistency between the smoothed BAF signal (without inference) and the inferred CNA states. In this sample, the original paper reports two major subclones that comprised 44% (subclone A) and 28% (subclone B) of the tumor mass within the identified tumor cells9. By Combining expression with BAF analysis, we annotated the cell clusters into Clone 1, Clone 2, and Normal cells, which we used as cell annotation for assessing the clonal reconstruction performance for XClone by calculating the ARI metric (0.9653; Fig. 4e).Then, we add a second recurrence astrocytoma sample with a newly emerged minor clone to perform the benchmarking. This GBM sample has undergone snRNA-seq from 10x Genomics, bulk whole exome sequencing (WES), and scOne-seq, which concurrently profiles DNA and RNA within individual cells with exceptional efficiency. Consequently, this sample serves as an excellent dataset with CNA ground truth for benchmarking. The CNA ground truth file of the minor clone (2R clone 1, with 53 cells) is in Supplementary Data 7. The number of consensus cells and genes used in the benchmarking are listed in Supplementary Data 8.These evaluations provide an understanding of the strengths and weaknesses of each CNA detection method, highlighting the areas where improvements might be needed. The insights gained from this comparison can guide the choice of methods in future studies, depending on the specific features of the CNAs that are of interest.Differential gene expression analysisWe conducted a differential gene expression analysis on the BCH869 glioma dataset, by utilizing the “rank_genes_groups” function in “scanpy”. This analysis aimed to identify statistically significant differences (absolute log2fc > 1 and adjusted p < 0.05) in gene expression between clone 1 and clones 2–4 on chromosome 14, where allele-specific loss is observed.The same Scanpy function was used to perform differential expression analysis on the TNBC1 breast cancer dataset. Comparisons were made between Clone 1 and Normal, and Clone 2 and Normal to determine DEGs. Subsequently, the “enrichGO” function in “ClusterProfiler”33 was used to perform gene set enrichment analysis on the DEGs, thereby identifying biological processes that are over-represented in these genes.Both the differential expression analysis and gene set enrichment processes were adjusted for multiple testing through False Discovery Rate (FDR) estimation, using the Benjamini–Hochberg (BH) method for p-value correction.Seed dataset GX109-T1c and the generation of BAM files for different simulations of challenging scenariosOur scCNAsimulator requires a reliable seed dataset as input to generate sequencing reads (supporting UMIs in a BAM format) over allelic CNA profiles (See the details in Supplementary Technical Notes). To support this, we used a clean dataset generated by an in-house collected fresh gastric cancer tissue GX109-T1c. This tissue was processed to yield single-cell RNA sequencing data, encompassing 5400 cells using the 5′ end protocol from 10x Genomics, as well as single-cell DNA sequencing data obtained through the 10x Genomics. Additionally, a bulk WES was conducted on a frozen sample of the same tissue (GX109-TF, sampled from an area contiguous with the blocks used for scRNA-seq and scDNA-seq, Supplementary Fig. 15) to provide a complete genomic picture. The CNA states and regions called from both bulk WES (via CNV-facet tool34) and scDNA-seq (via 10X CNV kit and CHISEL6) are highly concordant, including copy gain at chr8, chr20, copy loss at chr12p (chr12:7288360-16757762) and chr18, and CN-neutral LOH at chr1q, chr6p, chr10q and chr11q (Supplementary Figs. 15, 16; Supplementary Data 1 and 9). Additionally, the scDNA-seq CNA clonal structure is relatively simple in this sample, consisting of two major subgroups, with/without copy number changes. Following the regular analysis pipeline30, the scRNA data clustering of this sample further confirmed this clonal structure by separating the cells into two significant populations: one (3227 in 5400, 59.8%) was classified as tumor microenvironment cells with immune or fibroblast markers expressed, the other one (2153 in 5400, 39.9%) was classified as gastric stem/carcinoma cells with strong expression of stem cell markers and EPCAM (Supplementary Fig. 17). These two scRNA groups are considered concordant with the two scDNA-seq groups, respectively, with CN neutral and harboring CNAs, hence it can serve as an ideal seed dataset for our simulator scCNASimulator to generate synthetic data in diverse scenarios, as listed below.Simulation 1: Sub-clonal division within the tumor populationFor the 2153 cells categorized as “stem, cancer cell” we created two sub-populations of 1076 and 1077 cells. Each sub-group exhibited a unique allele copy number loss on chromosome 3 (See the details in Supp technical notes). We achieved this by 3 main steps: (1) Sub-clone Assignment: Assign each tumor cell to one of the sub-clones. Split the 2153 tumor cells into two groups of approximately equal size (1076 and 1077 cells). (2) Phasing: Use the output from Sanger Phasing, which provides allele-specific UMIs for chromosome 3 in each tumor cell. This phasing information assigns UMIs to specific alleles, allowing for the simulation of allele-specific copy number loss. (3) Modify the BAM File: Allele-Specific UMI Discarding or Retention: For each sub-clone, it discarded allele-specific UMIs from the BAM file to simulate a copy number loss for that particular allele on chromosome 3. This means removing UMIs that correspond to the allele with the copy number loss in one sub-clone but retaining them in the other. Discard Ambiguous UMIs: Remove any UMIs that do not have clear allele information (“ambiguous” UMIs) at the probability of 0.5 (mimic the natural state) to avoid introducing noise or confounding factors in the CNA analysis.Simulation 2: Reference cell downsampling to minor quantitiesThe downsampling process for reference cells effectively reduced the “immune cells” population in the BAM file to smaller, more specific subsets. Here is a step-by-step explanation of how the downsampling was performed: (1) Selection of Cell Barcodes: From the total pool of 3164 “immune cells”, we randomly selected a predetermined number of cell barcodes corresponding to the subset sizes we wanted to analyze 10, or 5 cells. (2) Filtering of UMIs: Using the selected cell barcodes, we then filtered the BAM file. This involved removing all UMIs that were associated with cell barcodes not included in the selected subsets. This ensures that the BAM file only contains data from the 10 or 5 “immune cells” that we aim to analyze. This simulation process was supported by samtools.$ samtools view -h in.bam ∣ grep -F -f selected_barcodes.txt ∣ samtools view -b > out.bamIn this pseudocode, selected_barcodes.txt would be a text file containing the list of sampled cell barcodes that we want to keep.Simulation 3: Tumor cell downsampling to limited quantitiesSimilar to the simulation in reference cells, from the 2153 “stem, cancer cell” group, we sampled smaller cohorts of 22, 10, and 5 cells to generate 3 different simulated BAM files.All of the tasks can be achieved by our customized tool scCNASimulator.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles