Mugen-UMAP: UMAP visualization and clustering of mutated genes in single-cell DNA sequencing data | BMC Bioinformatics

ImplementationMugen-UMAP is implemented in Python with three main features (Fig. 1). (i) convert, allows users to convert their somatic single-nucleotide variants (SNVs) annotation files and the metadata file into AnnData format [5], which stores a data matrix of genes by cells. Each entry in the matrix represents the number of mutations per gene for each cell. The input can be either a ZIP file or a directory containing the annotated mutation files of each cell, generated by ANNOVAR [6] through the annotation of related mutations in the Variant Call Format (VCF). The metadata file should contain the patient ID or sample ID in the first column, along with other related information, such as the type (histology type), stage (diagnostic stage), and relevant numerical data (e.g., number of cells). Our program will automatically select the non-numerical columns for subsequent plotting steps. (ii) umap, allows users to plot UMAP projections (e.g., for clinical subjects, colored by Patient ID, histology type, or diagnostic stage) by integrating and adjusting the common workflow of Scanpy [7] (includes (1) removing genes that are mutated in less than 3 cells, (2) excluding cells with less than 30 mutated genes, (3) excluding outlier cells with mutated gene counts that exceed 98% of all samples, (4) normalizing counts in each cell followed by logarithmization, (5) selecting the top 3000 highly variable genes, and regressing out the effects of total counts per cell), and to generate Venn diagram using Venny4Py (https://github.com/timyerg/venny4py), coupled with various summary reports. Moreover, visualizations for each filtering step (along with the corresponding cutoff values) will be generated (e.g., Fig. S1 for the NSCLC dataset), which allow users to assess the impact of the filtering steps and facilitate the optimization of filtering parameters specific to their studies. Furthermore, two clustering algorithms, Leiden [8] and Louvain [9], were provided for detecting cell clusters or patterns. (iii) all, execute the full pipeline, including both the convert and umap functions in sequence.Fig. 1The diagram of Mugen-UMAP workflow. A Single-cell somatic mutations annotated by ANNOVAR, coupled with corresponding patient information, were converted into the AnnData format. Subsequently, UMAP projections colored according to (B) Patient ID, C histology type, E diagnostic stage, F metastatic status, G Leiden algorithm, and D the Venn diagram were generated, along with various statistical analyses, utilizing the single-cell DNA sequencing data. The numbers in the Venn diagram represent the counts of mutated genes shared among the different histological subtypes of NSCLC, including adenocarcinoma, squamous cell carcinoma, large cell carcinoma, and spindle cell carcinomaApplication of Mugen-UMAP to example datasetsTo demonstrate the capabilities of Mugen-UMAP, we applied it to a dataset comprising 365 single-cell samples isolated from the primary tumors of 12 NSCLC patients (with a median of 23 cells per patient, ranging from 7 to 71), coupled with one corresponding normal bulk tissue for each patient [4] (Table 1). Whole exome sequencing was performed for all samples using the Illumina platform, achieving an average coverage depth of 198.1X for normal bulk tissues (median depth of 163.8X) and 101.5X for tumor single cells (median depth of 100.1X). Somatic SNVs were detected individually for each tumor single cell sample against the matched normal bulk sample by VarScan v2.4.3 [10], with the default parameters except increasing the minimum read coverage to at least 10 reads in both tumor and matched normal samples. Then, somatic SNVs located within the repeat region (as annotated by RepeatMasker) on the UCSC Table Genome Browser [11] and those falling outside the exon target regions were excluded. To avoid potential low-quality somatic SNV calling, SNVs were retained if these sites could be genotyped by GATK HaplotypeCaller [12] in at least 70% of all samples for each patient.
Table 1 12 non-small cell lung cancer (NSCLC) patients informationFurthermore, to showcase the broad applicability of Mugen-UMAP, we obtained 9 single-cell WES datasets from various studies [13,14,15,16,17,18] (Table 2), encompassing 332 single-cell samples from six different cancer types (including bladder, blood, breast, colon, kidney, and lung). Each dataset represents an individual patient, except for Wu-CRC0827 and WuCRC0827-polyps, which are from the same patient. The pipeline for processing SNV calling of these 9 datasets was described in Borgsmüller et al. [19]. For both example datasets, the mutations in the VCF files of each cell were then annotated using ANNOVAR [6] with the Catalogue of Somatic Mutations in Cancer (COSMIC) database [20], and only non-synonymous SNVs were retained for subsequent analysis. However, for the 9 additional single-cell WES datasets, because the total number of mutated genes remaining after filtering was only 1002, we retained all of these genes for subsequent analysis.
Table 2 9 published single-cell whole-exome sequencing (WES) cancer datasets

Hot Topics

Related Articles