Denoiseit: denoising gene expression data using rank based isolation trees | BMC Bioinformatics

DataOur analysis incorporates two datasets: The Cancer Genome Atlas (TCGA) public dataset [14] and the COVID-19 dataset [15]. The TCGA public dataset is a well-established and widely used resource for cancer genomics research. Here, we utilized the gene expression profiles of four cancer types, which are COAD (Colon adenocarcinoma, n = 222), STAD (Stomach adenocarcinoma, n = 305), BRCA (Breast invasive carcinoma, n = 595) and LUAD (Lung adenocarcinoma, n = 180) that were initially comprised of 56,000 genes. Genes with less than an average expression count of five were excluded, resulting in 24852, 24913, 25428 and 24913 genes in the COAD, STAD, BRCA and LUAD datasets, respectively.The COVID-19 dataset is a multi-omics dataset that includes gene expression profiles of peripheral blood mononuclear cells (PBMCs) from COVID-19 infected patients in South Korea. The COVID-19 dataset encompasses single-cell RNA-seq data from 456 samples comprised of 20212, 20875, 20541, 19541 and 16874 genes in each of the CD4 T, CD8 T, Monocyte, Natural Killer (NK) and B cell type, respectively. To objectively categorize the severity of each sample, the World Health Organization score (WHO [16]) was utilized as our metric. The maximum WHO score attained by each sample during their hospital stay served as the benchmark for labeling severity. Specifically, samples with a maximum WHO score exceeding 5 were classified as “Severe”(n = 92), while those with a maximum score of 5 or below were designated as “Moderate”(n = 364). This stratification methodology was implemented to ensure a uniform and fair assessment of sample severity across the cohort. The COVID-19 dataset was preprocessed by first annotating the cell type of each scRNA-seq sample using Azimuth [17], then aggregating each cell type’s single-cell RNA-seq sample into a single pseudobulk sample. At the cell type level 1 of the Azimuth reference dataset, six cell types are present: CD4 T, CD8 T, NK, B, Monocyte and dendritic cells. Among those, the dendritic cell type exhibited less than three cell counts in all the samples, and thus was not considered for further analysis. A total of 1,364,590 cells remained, that were distributed as follows: CD4 T = 160,646, CD8 T = 174,239, B = 145,866, NK = 157,844 and Monocyte = 599,344. Finally, each cell type specific pseudobulk sample was subject to trimmed mean of M-values (TMM) normalization. A pseudobulk sample was made by the aggregated sum of the gene expression counts for each cell type. For the further evaluation, simulated gene expression count data were generated. We created samples that follow the negative binomial distribution for 20,000 genes varying sample sizes of n = 25, 50, 75, 100, 150, 200, 300, and 400. All the simulated samples were structured into two groups to reflect two distinct conditions where genes were differentiated based on their ability to distinguish between these groups, characterized by a true log fold change (logFC) value. The average expression count of genes was set to 4 with a variance of 5. In addition, simulation data with no variance (i.e., no noise) between the samples within each group were generated to observe how many genes are retained by the various gene removal methods. For such purpose, the variance was set to 1, and the average expression count was set to 4.Performance evaluationTechnical evaluation was performed on the noise removed gene set output from DenoiseIt by comparing it with various unsupervised gene filtering methods, noisyR, PCAUFE, OutSingle, MGSACO and kVirtuals. For both TCGA and COVID-19 datasets, here the evaluation criteria was how well the noise removed gene set of each method was able to discriminate the cancer subtypes or the severity of COVID-19 patients per cell type, respectively. Below, a brief description of three gene selection methods are described that are used for the performance evaluation. The methods were further compared for biological correctness via DEG and pathway analysis. DEG analysis was performed on each method’s output gene set to investigate how well the DEGs from each gene set captured dataset related pathways. DESeq2 and edgeR were used for the DEG analysis.Threshold based gene filtering methodThe noisyR is a backward search method that reduces the impact of low-abundance genes on differential expression analysis. It uses read count values to determine the similarity between samples and applies a threshold to filter out genes that contribute to noise. kVirtuals is a forward search method selecting only small number of genes based on gene clustering using Normalized Mutual Information (NMI). Both method had their own threshold for filtering genes. PCAUFE is a PCA based unsupervised gene selection method. From the PCAUFE results, PC1 and PC2 were used to select genes with an adjusted p-value below 0.05.Gene ranking based filtering methodOutSingle provides gene selections based on p-value, while MGSACO outputs genes by specifying the desired number of genes to be retained. To ensure a fair comparison, we selected the same number of genes as DenoiseIt for each dataset.Wrapper methodWrapper methods iteratively add or remove features to optimize the model’s performance. Similarly, SAFS performs iterative clustering to enhance the clustering quality by adding or removing genes from the dataset. Unlike other methods, SVM-RFE is a supervised method, thus it is not specifically applicable for gene selection or filtering when sample labels are unknown. Nevertheless, it was included to observe its performance in comparison to the other unsupervised methods.No gene filteringTo evaluate the methods performance compared to the case without any gene filtering, we employed the full set of genes available in the datasets. Genes with expression values of zero across all samples were removed. Subsequently, the remaining data underwent \(log_2\) transformation, normalization and then min-max scaling to ensure consistency and to mitigate potential biases in the data. The TCGA expression data were quantile normalized, whereas TMM was used for single-cell data.Workflow of DenoiseItDenoiseIt is a novel approach for removing outlier genes from gene expression data. The rationale behind DenoiseIt is grounded in the hypothesis that samples within a same group should display similar expression patterns. Here, a group refers to a set of patients or samples with similar phenotypic background. Using NMF [18], we can easily observe whether the sample groups are well captured by the rank features. More importantly, by observing the rank features we can identify samples that deviate from their respective groups, along with the genes responsible for such discrepancies.Fig. 1DenoiseIt consists of three stages. First, it processes the gene expression data and decomposes it into basis and loading matrices using NMF. In the second step, each rank feature from the decomposed result are used to generate isolation trees to compute its outlier score. Finally, ranks are labeled as either inliers and outliers where genes associated to outlier ranks are removed. The remaining genes are used as input to various downstream analysisDenoiseIt is comprised of three stages: (1) NMF analysis, (2) outlier score computation and (3) outlier detection and removal (Fig. 1). In the first stage, the gene expression data is subject to NMF. The output of NMF are two matrices which serve as the input for the subsequent steps. In the second stage, outlier scores are computed using the loading and basis matrices from the NMF output. Here, isolation forests are generated on the decomposed ranks in H to compute the outlier scores, which effectively quantifies the likelihood of a rank being an outlier. In the last stage, genes associated to outlier ranks are identified using W and removed. This approach can also be applied for identifying outlier rank associated samples and prune any outlier samples instead of genes.Stage 1: NMF analysisFirst, genes with an average read count value of less than 5 were removed. The expression levels of the remaining genes were then \(log_2\) transformed and quantile normalized. Consider the gene expression dataset as a \(n \times m\) matrix K, where n and m refer to the number of genes and samples in matrix K respectively. Each column in K corresponds to a sample \(s_j\) for \(j=1, 2,…, m\), and each row corresponds to a gene \(g_i\) for \(i=1, 2,…, n\). After performing NMF on matrix K, two non-negative matrices W (\(n \times q\) matrix) and H (\(q \times m\) matrix) are obtained, where q is the number of ranks, which are denoted as \(r_t\) for \(t=1,2,…q\), such that \(K \approx WH\). Here, W and H represent the basis (gene component) and the coefficient (sample component) matrices, respectively.Stage 2: Outlier score computationFor each rank feature t, we utilize the Isolation Forest algorithm [11] to calculate an outlier score for each sample. Isolation Forest is an ensemble method that generates binary trees to isolate instances considered as outliers. The core idea behind the Isolation Forest is that anomalous data points are easier to isolate than inlier data points. The algorithm randomly selects a feature and randomly sets a split value between the maximum and minimum values of the selected feature. This process is repeated until the data points are isolated or the maximum number of tree depths is reached. The algorithm then computes an outlier score for each instance based on the average path length from the root node to the outlier node in all trees.The output of the Isolation Forest algorithm is the outlier score matrix \(S=q \times m\). Here, an isolation tree \(T_l\) is constructed for each rank t using H. A total of p number of trees, \(l=1,2,…,p\), are constructed for a single rank t to compute an outlier score for each sample j, which is denoted as \(S_{t,j}\). The computation of an outlier score of \(S_{r,s}\) can be encapsulated as follows:Let’s assume a randomly selected subset of samples \(s’\) is used to construct a tree \(T_l\) for rank t. The nodes of \(T_l\) are samples. Starting from the root, a split point x is randomly selected between the range of \(min(H_t)\) and \(max(H_t)\) to split the samples in tree \(T_l\). The selection of split point x is done for each split w. The splitting continues until each sample is a leaf node or the maximum tree depth d is reached. The maximum tree depth is set to \(d = \lceil \log _2(|s’|) \rceil \) to ensure that all samples can be isolated if they were the only ones in their leaves.$$\begin{aligned} SPLIT_w = {\left\{ \begin{array}{ll} s_{j} \in s’: H_{t,j} < x_w, & \text {if } s_{j} \text { in left child node}\\ s_{j} \in s’: H_{t,j} \ge x_w, & \text {if } s_{j} \text { in right child node}\\ \end{array}\right. } \end{aligned}$$
(1)
If a sample j is frequently located near to the root while other samples are evenly distributed within the tree in all trees, then the distance from sample j to the root against the average distance of other samples to the root will be statistically significant. Such property implies that samples such as j have very different rank values than the other samples and thus is a candidate for being an outlier. The distance of sample j to the root in tree l is defined as \(L_{s_{j}}^{(T_l)}\). Then, the average distance of \(s_{j}\) to the root in all trees is \(A_{t,j}\) and defined as follows:$$\begin{aligned} A_{t,j} = \frac{1}{p} \sum _{l=1}^{p} L_{s_{j}}^{(T_l)} \end{aligned}$$
(2)
The average distance is normalized in respect to the maximum distance from the root to a leaf node, which gives us the final outlier score of a sample j in rank t, \(S_{t,j}\), as below.$$\begin{aligned} S_{t,j} = \frac{A_{t,j} – \min (A)}{\max (A) – \min (A)} \end{aligned}$$
(3)
Here, \(\min (A)\) and \(\max (A)\) represent the minimum and maximum anomaly scores over all the samples, trees and ranks. Since we aim to identify outlier ranks, this normalization ensures that the outlier scores are comparable across the ranks.Stage 3: Outlier identification and gene removalOnce the outlier scores have been calculated, the next step is to identify and remove outlier ranks. Here, the one sample t-test is used to check whether the average outlier score of a sample j for a given rank t significantly deviates from the average outlier score across all rank features in S. For a rank t, the hypothesis and null hypothesis are$$\begin{aligned} H_0: \mu _{S_{t}}= & \mu _{S_{all}} \end{aligned}$$
(4)
$$\begin{aligned} H_a: \mu _{S_{t}}\ne & \mu _{S_{all}} \end{aligned}$$
(5)
where \(\mu _{S_{t}}\) is the average outlier score for the rank t, and \(\mu _{S_{all}}\) is the average outlier score across all ranks. If the p-value is significant (i.e,. < 0.05), we reject the null hypothesis and consider the rank t as an outlier. Once the outlier ranks are identified, we proceed with the gene removal process. Here, each gene in the W is assigned to a single rank with maximum rank value. The primary gene candidates for removal are those that are assigned to the outlier ranks. Given that the genes exhibit a unique expression pattern only within a particular set of samples, their removal could help improve the overall robustness and reliability of the analysis by reducing the gene expression variance within a sample group.

Hot Topics

Related Articles