ScRNAbox: empowering single-cell RNA sequencing on high performance computing systems | BMC Bioinformatics

scRNAbox overviewThe scRNAbox pipeline consists of R scripts utilizing the Seurat framework, and other R packages including Doublet finder and SoupX for scRNAseq analysis, which are submitted to the SLURM workload manager (job scheduling system for Linux HPC clusters) using bash scripts from the command line [17]. Beginning with 10X Genomics expression data from raw sequencing files, the pipeline facilitates standard steps in scRNAseq processing through to differential gene expression between two different conditions. The scRNAbox framework consists of three main components: (i) R scripts, (ii) job submission scripts, and (iii) parameter and configuration files. The pipeline is separated into Steps, which correspond to analytical tasks in the scRNAseq analysis workflow (Fig. 1). Users can tailor their analysis by manipulating the parameters in the step-specific parameter files. The pipeline can analyze scRNAseq experiments where each sample is captured separately (standard track) or multiplexed experiments where samples are tagged with sample-specific oligonucleotide tagged Hashtag antibodies (HTO), pooled, and sequenced together (HTO track) [18, 19]. The results of each step are reported in intuitive tables, figures, and intermediate Seurat objects [9]. Upon submitting the bash script for a step, “Jobs”, or resource requests are created based on the parameters defined in the configuration file, including CPUs, memory, and time. Jobs are submitted to the HPC system using the SLURM “Scheduler” to execute the R scripts. A complete user guide and the code used in this manuscript can be found at the scRNAbox GitHub site: https://neurobioinfo.github.io/scrnabox/.Fig. 1ScRNAbox analysis workflow. The scRNAbox pipeline provides two analysis tracks: 1) standard scRNAseq and 2) HTO scRNAseq. A Standard scRNAseq data is prepared by sequencing each sample separately, resulting in distinct FASTQ files for each sample. B HTO scRNAseq data is produced by tagging the cells from each sample with unique oligonucleotide “Hashtag” conjugated antibodies (HTO). Tagged cells from each sample are then pooled and sequenced together to produce a single FASTQ file. Sample-specific HTOs are used to computationally demultiplex samples downstream. C Steps of the scRNAbox pipeline workflow. Steps are designed to run sequentially and are submitted using the provided bash scripts through the command line. scRNAbox takes FASTQ files as input into Step 1; however, the pipeline can be initiated at any step which takes the users processed data as inputInstallationScRNAbox can be installed by two different methods on any HPC Linux system. 1) Direct installation via the scrnabox.slurm package, which contains the Bash and R scripts, parameter files, and configuration file. The HPC system must have CellRanger (10 × Genomics) and R (v4.2.1 or higher) [20] installed and must use a SLURM scheduler. Users must also run a provided bash script which will automatically install all of the R libraries required for the scRNAbox pipeline. 2) Through a Singularity container, which provides the Bash and R scripts, R library, and Cell Ranger. The container can work with the SLURM scheduler or without job submissions directly on a local computer with sufficient memory. The Singularity container is available from Zenodo: https://zenodo.org/records/12751010.Step 0: Initiation and configurationFollowing installation, users run Step 0 to initiate the pipeline and specify if they will use the standard or HTO analysis track. Step 0 creates the job submission configuration files and the step-specific parameter files. The configuration file contains the time and memory usage settings for each step and must be edited to match the user’s needs. After Step 0, each subsequent Step can be run individually through separate commands or all together in a single command.Step 1: FASTQ to gene expression matrixFile structure and inputsPrior to running the CellRanger counts pipeline, a parent directory (“samples_info”) must be created in the working directory. The “samples_info” directory must contain a folder for each sample; the name of the sample-specific folders will eventually be used to name the samples in downstream steps. Each sample-specific folder must contain a library.csv file, which defines the information of the FASTQ files for the specific sample. The HTO analysis track also requires a feature_ref.csv file, which specifies the oligonucleotide sequences of the Hashtags. Step 1 runs a script to automatically generate these files based on the user input in the parameter file. However, users can manually generate the required files and structure.Running cellrangerScRNAbox deploys the CellRanger counts pipeline to perform alignment, filtering, barcode, and unique molecular identifier counting on the FASTQ files. Each sample is processed by the CellRanger counts pipeline in parallel. Although CellRanger is processed with default parameters, all relevant parameters can be adjusted (10X Genomics).Step 2: Create Seurat object and remove ambient RNAAmbient RNA detectionThe R package SoupX is used to account for ambient RNA, providing users the option to correct the gene expression matrices for RNA contamination [21]. SoupX quantifies the contamination fraction according to the expression profiles of empty droplets and cell clusters identified by the CellRanger counts pipeline. Marker genes used to estimate the contamination rate are automatically identified using the AutoEstCont function and the expression matrix is corrected per the estimated contamination rate using the adjustCounts function.Generation of the seurat object and quality control metricsThe Seurat function CreateSeuratObject is used to take in the CellRanger (if not removing ambient RNA) or SoupX (if removing ambient RNA) generated feature-barcode expression matrices, and create the list-type Seurat object [9]. The number of genes expressed per cell (number of unique RNA transcripts) and the total number of RNA transcripts are automatically computed. The proportion of RNA transcripts from mitochondrial DNA (gene symbols beginning with “MT”) and the proportion of ribosomal protein-related transcripts (gene symbols beginning with “RP”) are both calculated using the Seurat PercentageFeatureSet function. Following the Seurat workflow, the CellCycleScoring function with the Seurat S and G2/M cell cycle phase reference genes are used to calculate the cell cycle phase scores and generate a principal component analysis (PCA) plot [22].Step 3: Quality control and generation of filtered data objectsScRNAbox allows users to filter low quality cells by defining upper- and lower-bound thresholds in the parameter files based on unique transcripts, total transcripts, percentage of mitochondrial-encoded transcripts, and percentage of ribosome gene transcripts. Users can also remove or regress a custom gene list from the dataset. The filtered counts matrix is then normalized, the top variably expressed genes are identified, and the data are scaled using Seurat functions. Linear dimensional reduction is performed via PCA and an elbow plot is generated to visualize the dimensionality of the dataset and inform the number of principal components (PC) to be used for doublet detection in Step 4.Step 4: Demultiplexing and doublet removalDoublet detection and removal (Standard track)Barcodes that are composed of two or more cells are identified as doublets using DoubletFinder [23]. Doublets are predicted based on the proximity of each cell’s gene expression profile to that of artificial doublets created by averaging the transcriptional profiles of randomly chosen cell pairs. The default value of 0.25 for the number of artificial doublets is used. The neighbourhood size corresponding to the maximum bimodality coefficient is selected and the proportion of homotypic doublets is computed using the modelHomotypic function. Users can define the number of PCs to use for doublet detection and the expected doublet rate for each sample. Users have the option to remove doublets from downstream analyses or just calculate the doublet rate.Demultiplexing followed by doublet removal (HTO track)Pooled samples are demultiplexed based on their sample-specific HTO labels using Multi-seq [19]. The automatically detected inter-maxima quantile thresholds of the probability density functions for each barcode are used to classify cells. Cells surpassing one HTO threshold are classified as singlets; cells surpassing > 1 thresholds are classified as doublets; the remaining cells are assigned as “negative”. The counts observed for each barcode are reported in a summary file and plots are generated to visualize the enrichment of barcode labels across sample assignments. Users have the option to remove doublets and negatives from downstream analyses.Step 5: Creation of a single Seurat object from all samplesIntegration or merging samplesThe individual Seurat objects are integrated to enable the joint analysis across sequencing runs or samples by deploying Seurat’s integration algorithm [24]. The genes that are variable across all samples are detected by the SelectIntegrationFeatures function. Integration anchors (pairs of cells in a matched biological state across datasets) are selected by the FindIntegrationanchors function, and the IntegrateData function is used to integrate the datasets by taking the integration anchors as input. Alternatively, users may simply merge the normalized counts matrices using Seurat’s merge function without performing integration.Linear dimensional reductionSeurat functions are used to normalize the count matrix, find the most variably expressed genes, and scale the data. Linear dimensional reduction is then performed via PCA using the top variably expressed genes as input. An elbow plot to visualize the variance contained within each PC and jackstraw plot to visualize “significant” PCs are produced. These plots inform the number of PCs that should be retained for clustering in Step 6.Step 6: ClusteringClustering is performed to define groups of cells with similar expression profiles using the Seurat implementation of the Louvain network detection with PCA dimensionality reduction as input [9]. K-nearest neighbours are calculated and used to construct the shared nearest neighbour graph. The Jaccard similarity metric is used to adjust edge weights between pairs of cells, and the Louvain algorithm is used to iteratively group cells together based on the modularity optimization. To assist users in selecting the optimal clustering conditions, we include an option to compute the Louvain clustering N times at each clustering resolution, while shuffling the order of the nodes in the graph for each iteration. The average and standard deviation of the Adjusted Rand Index (ARI) between clustering pairs at each clustering resolution is then calculated [25]. A ClustTree plot [26] and uniform manifold approximation and projection (UMAP) plots are generated to visualize the effect of clustering parameters.Step 7: Cluster annotationCluster annotation is performed to define the cell types comprising the clusters identified in Step 6. ScRNAbox provides three tools to identify cell types comprising the clusters.Tool 1: Cluster marker gene identification and gene set enrichment analysisScRNAbox identifies genes that are significantly up regulated within each cluster by using the Seurat FindAllMarkers function, implementing the Wilcoxon rank-sum test [9] with a log2 fold-change (L2FC) threshold of 0.25. Differentially expressed genes (DEGs) are calculated by comparing each cluster against all other clusters. Only upregulated genes are considered for cluster marker genes. A heatmap is generated to visualize the expression of the top marker genes for each cluster at the cell level. All significant, upregulated DEGs are used as the input for gene set enrichment analysis (GSEA) across user-defined libraries that define cell types using the EnrichR tool [27]. Cluster-specific tables are generated to report all enriched cell types and bar plots visualize the most enriched terms.Tool 2: Expression profiling of cell type markers and module scoresScRNAseq allows users to visualize the expression of individual genes and the aggregated expression of multiple genes from user-defined cell type marker gene lists. For each gene in a user-defined list, a UMAP plot visualizes its expression at the cell level, while violin and dot plots visualize its expression at the cluster level. Aggregated expression of user-defined cell type marker gene lists is calculated using the Seurat AddModuleScore function [22]. The average expression of each cell for the gene set is subtracted from randomly selected control genes, resulting in cell-specific expression scores, with larger values indicating higher expression across the gene set.Tool 3: Cell type predictions based on reference dataScRNAbox utilizes the Seurat label transfer method: FindTransferAnchors and TransferData functions, to predict cell-type annotations from a reference Seurat object [24]. Predicted annotations are directly integrated into the query object’s metadata and a UMAP plot is generated to visualize the query dataset, annotated according to the predictions obtained from the reference.Adding annotationsScRNAbox uses the Seurat AddMetaData function and a user-defined list of cell types in the parameter file to add cluster annotations. The cluster annotations from each iteration of the step will be retained, allowing users to define broad cell types and subtypes. UMAP plots with the annotation labels are generated to visualize the clustering annotations at the cell level, allowing users to check the accuracy of their annotations.Step 8: Differential gene expression analysis (DGE)Metadata defining the groups to be compared are added to the Seurat object by submitting a.csv file containing sample information with phenotypic or experimental data. The additional metadata is used to define the variables to compare for the DGE. ScRNAbox allows DGE to be calculated between conditions using all cells or cell type groups using two different data preparations: cell-based or sample-based DGE.Cell-based DGECells are used as replicates and DGE is computed using the Seurat FindMarkers function to compare user-defined contrasts for a given variable [9]. While FindMarkers supports several statistical frameworks to compute DGE, we set the default method in our implementation to MAST, which is tailored for scRNAseq data [28]. MAST models both the discrete expression rate of all genes across cells and the conditional continuous expression level, which is dependent on the gene being expressed in the cell, by a two-part generalized linear model [28]. Regardless of the method used, P values are corrected for multiple hypothesis testing using the Bonferroni method. Users can perform their own p-value adjustments using the DEG files output from the pipeline.Sample-based DGETo calculate DGE using samples or subjects as replicates, scRNAbox applies an aggregate pseudo-bulk analysis [29]. First, the Seurat AggregateExpression function is used to compute the sum of RNA counts for each gene across all cells from a sample [30]. These values are then input into the DESeq2 framework, which uses gene dispersal to calculate DGE [31]. P-values are corrected for multiple hypothesis testing using the Bonferroni method, which can be recalculated from the pipeline output.Analysis of differentially expressed genesStep 8 produces data tables of the DEGs for each of the defined contrasts. These outputs can be used for gene enrichment pathway analysis using web-apps or though application program interfaces with reference libraries using a programming language, in our case, R. Further analysis of the results is experiment-dependent and must be completely tailored to the research questions. We used the ClusterProfiler R package to identify significantly enriched Gene Ontology (GO) terms with the gseGO function [32]. We utilized the ‘org.Hs.eg.db’ Bioconductor annotation package to access human (Homo sapiens) gene annotations for our analysis. The ggplot2 R package was used for data visualization [33].

ScRNAbox: empowering single-cell RNA sequencing on high performance computing systems | BMC Bioinformatics

Quantifying constraint in the human mitochondrial genome

Integrative biomarker discovery and immune profiling for ulcerative colitis: a multi-methodological approach

The dynamics of blaTEM resistance genes in Salmonella Typhi

Influence of climate on desert locust (Schistocerca gregaria Forskål, 1775) Plague and migration prediction in tropics

Global profiling of protein complex dynamics with an experimental library of protein interaction markers

Hot Topics

Quantifying constraint in the human mitochondrial genome

Integrative biomarker discovery and immune profiling for ulcerative colitis: a multi-methodological approach

The dynamics of blaTEM resistance genes in Salmonella Typhi

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Quantifying constraint in the human mitochondrial genome

Integrative biomarker discovery and immune profiling for ulcerative colitis: a multi-methodological approach

The dynamics of blaTEM resistance genes in Salmonella Typhi

Influence of climate on desert locust (Schistocerca gregaria Forskål, 1775) Plague and migration prediction in tropics

Popular Articles

Quantifying constraint in the human mitochondrial genome

Integrative biomarker discovery and immune profiling for ulcerative colitis: a multi-methodological approach

The dynamics of blaTEM resistance genes in Salmonella Typhi