starTracer is an accelerated approach for precise marker gene identification in single-cell RNA-Seq analysis

Rationale of starTracerSingle-cell sequencing data includes an expression matrix and an annotation matrix (Fig. 7A). The previous strategy for identifying marker genes within a cluster is largely dependent on comparing the expression levels of a gene in a particular cluster with the corresponding expression levels in the remaining clusters4,5. While generally effective, this approach may not always yield markers with superior specificity. Such a circumstance may arise when facing the “dilution” issue: a gene is relatively highly expressed not only in the target cluster but also in one or more additional clusters. Moreover, since this strategy entails a comparison of one cluster versus the remaining clusters for each cluster in question, there is a considerable degree of redundant computation. This redundancy can render the process of identifying marker genes markedly time-consuming.Fig. 7: Schematic of starTracer.A The structure of the expression matrix and annotation matrix of single-cell sequencing. B starTracer provides 2 options: de-novo “searchMarker” and in-conjunction “filterMarker”. “searchMarker” requires a cell annotation matrix and an expression matrix/averaged expression matrix from a single cell experiment or a Seurat object. “searchMarker” performs max-normalize the average expression matrix, calculates the molecular index for each gene passing a threshold set by the user and outputs a matrix with marker genes. C “filterMarker” takes an output matrix form “FindAllMarkers” function, assigns genes into clusters and re-arranges them by measuring the ${T}_{i}$ for each gene. Time elapsed of “searchMarker” is much shorter than that of “FindAllMarkers” and “filterMarker”.starTracer, with its “searchMarker” module, operating as an independent pipeline, receiving a gene expression matrix as input (Fig. 7B). Within starTracer, we employ max-normalization, a commonly used algorithm in machine learning, to extract features from genes while preserving the gene’s expression patterns across clusters. Utilizing a threshold value, starTracer binarily classifies clusters as high and low expression groups for each gene. Subsequently, starTracer calculates the gene’s potential to act as a positive marker in the high-expression group and as a negative marker in the low-expression group. Finally, the gene’s overall potential to serve as a marker gene is determined based on both its positive and negative marker capabilities. Notably, as no comparisons are made during this process, no statistical tests are performed, resulting in heightened efficiency. The output of starTracer includes a list of genes sorted by their potential to function as cluster-specific markers, with genes exhibiting the highest potential positioned at the top of the matrix for each respective cluster.The package starTracer includes an additional functional module “filterMarker”, which is designed as a complementary pipeline to the Seurat’s “FindAllMarkers” function, to automatically remove the redundant results in the output matrix and re-order the matrix according to ${T}_{i}$, representing the proportion of cells expressing gene i that are marked by gene i (Fig. 7C). For notations and meanings, please refer to Table 1.Table 1 Notations and MeaningsGenerating a maximum-scaled average expression matrixThe “searchMarker” module demonstrates great flexibility by accepting a variety of input file types. One essential input file is the expression matrix of the single-cell sequencing data. This can be acquired through three different ways, showcasing the module’s versatility: (i) importing a sparse expression matrix along with an annotation matrix into R; (ii) utilizing the “AverageExpression” function from Seurat for users who already have a Seurat object; and (iii) importing an average expression matrix along with an annotation matrix into R. Anyway, an average expression matrix will be generated (Supplementary Fig. 7 Step 1). This adaptability ensures seamless integration with various data processing workflows.After generating the average expression matrix, the mean expression value of gene i from the average expression matrix will be calculated, which is denoted as ${E}_{1}^{i}$ (Supplementary Fig. 1 Step 2):$${E}_{1}^{i}=\frac{{\sum }_{j=1}^{K}{x}_{{ij}}}{K}.$$${E}_{1}^{i}$ represents the mean expression of gene i in K clusters. ${x}_{{ij}}$ represents the average expression value of gene i in cluster j.In order to accentuate the differences in gene expression across samples, whilst preserving the inherent expression characteristics of each gene, such as the maximum and minimum expression values and zero values, we compute the maximum-scaled average expression matrix38 (hereafter referred to as the max-normalized matrix) for each gene (Supplementary Fig. 1 Step 3):$\widetilde{{x}_{{ij}}}=\frac{{x}_{{ij}}}{\max \left(\left\{{x}_{i1},{x}_{i2}\ldots ,{x}_{{iK}}\right\}\right)}.$$K$ represents the total number of clusters.$\widetilde{{x}_{{ij}}}$ represents the max-normalized average expression value of gene i in cluster j.By normalizing the average expression matrix, the highest expression value for each gene is scaled to 1:$$\max \{\widetilde{{x}_{i1}},\widetilde{{x}_{i2}},\ldots ,\widetilde{{x}_{{iK}}}\}=1.$$This facilitates the identification of the cluster with the maximum expression for each gene. Furthermore, this normalization method allows for the comparison of normalized results across genes within each cluster.Assigning genes to clusters and generating sub-matrixesFor the assignment of genes to their potential target clusters, we employ a selection strategy where a gene is assigned to the cluster only if it displays the highest average expression value in this particular cluster. This assumption is based on the premise that a gene may only have the potential to be a marker in a cluster where$\,\widetilde{{x}_{{ij}}}=1$. From the aspect of each cluster, the genes with $\widetilde{{x}_{{ij}}}=1$ will be retained and be regarded as the potential markers (Supplementary Fig. 1 Step 3). Thus, we define an index set ${M}_{j}\subset \{{{\mathrm{1,2}}},\ldots ,{L}\}$ for cluster j:$${M}_{j}=\{{i}:\widetilde{{x}_{{ij}}}=1,1\le {i}\le L\},$$where Mj is the set for cluster j, which indicates the rows from the original matrix that equals to 1, representing the maximum expressing values in cluster j.Let the original max-normalized matrix as V and$$V\subset {R}^{L\times K},$$we further divide V to sub-matrixes ${V}_{j}$ for each cluster j according to Mj:$${V}_{j}={\left({v}_{m,n}\right)}_{m\in {M}_{j},n\in \{1,2,\ldots ,K\}}.$$The rows of the sub-matrix ${V}_{j}$ is from the set Mj, while the columns are composed of the clusters from 1 to K. By doing so, the algorithm could successfully assign genes to clusters and avoid redundant calculations. We assume that there are ${l}_{j}$ genes in sub-matrix ${V}_{j}$, where ${l}_{j}=|{M}_{j}|$.Excluding low-expression genesIn the subset-matrix ${V}_{i}$, genes will be re-arranged in descending order according to ${E}_{1}^{i}$ (Supplementary Fig. 1 Step 5). In the majority of scenarios, genes with rather low expression levels may not considered optimal candidates for marker genes. To streamline computational efficiency by excluding these low-expression genes, we offer an optional threshold statistic, denoted as ${S}_{2}$, which ranges from 0 to 1. The statistic can be employed to filter out the lowest ${S}_{2}$ percentile of genes for each cluster after re-arranging according to their ${E}_{1}^{i}$ (e.g. Setting ${S}_{2}$ to 0.1 will filter out the genes with the lowest 10% of ${E}_{1}^{i}$) (Supplementary Fig. 1 Step 6). After filtering, ${{S}_{2}} * {l}_{j}$ genes will be filtered out while $({1-S}_{2})\,* \,{l}_{j}$ genes will be retained for further analysis.Binary classification of clusters and the calculation of
${{{{\boldsymbol{E}}}}}_{{{{\boldsymbol{2}}}}}^{{{{\boldsymbol{i}}}}}$
and
${{{{\boldsymbol{E}}}}}_{{{{\boldsymbol{3}}}}}^{{{{\boldsymbol{i}}}}}$
to avoid dilution issue
For each gene, researchers can set a parameter ${S}_{1}$, which defaults to 0.5 (Supplementary Fig. 2A) to segregate clusters into high and low expression groups.For gene i, it is presumed to serve as a positive marker in cluster j where ${\widetilde{x}}_{{ij}}$ exceeds ${S}_{1}$(high expression group), and a negative marker where $\widetilde{{x}_{{ij}}}$ is no bigger than ${S}_{1}$(low expression group).Thus, for gene i, there would be two vectors, one is the vector including clusters that gene i has the potential to be a positive marker (Supplementary Fig. 1 Step 7):$${q}_{i}=\{j:\widetilde{{x}_{{ij}}}\le {S}_{1}\},$$and the one including the clusters that gene i has the potential to be a negative marker:$${p}_{i}=\{j:\widetilde{{x}_{{ij}}} \, > \, {S}_{1}\},$$Thus, we can have two vectors for each row of Vj:$${u}_{i,j}=\left\{{v}_{m,n}{{{|}}}m={i},{n}\in {{p}}_{{i}}\right\}$$And$${w}_{j,i}=\left({v}_{m,n}{{{|}}}m={i},n\in {q}_{i}\right).$$We then, respectively, define:$${E}_{2}^{i}=\frac{{\sum }_{j=1}^{n}\widetilde{{x}_{{ij}}}}{\left|{p}_{i}\right|},$$and$${E}_{2}^{i}=\frac{{\sum }_{j=1}^{\left|{p}_{i}\right|}\widetilde{{x}_{{ij}}}}{\left|{p}_{i}\right|}.$$Here, ${E}_{2}^{i}$ represents the mean of $\widetilde{{x}_{{ij}}}$ values for these clusters from ${p}_{i}$, and ${E}_{3}^{i}$ represents the mean of $\widetilde{{x}_{{ij}}}$ values from ${q}_{i}$.In a scenario where ‘n’ clusters have max-normalized values greater than ${S}_{1}$. According to previous conclusions, it is certain that $\max \{\widetilde{{x}_{i1}},\widetilde{{x}_{i2}},\ldots ,\widetilde{{x}_{{iK}}}\}=1$. Indeed, it is observable that ${E}_{2}^{i}$ and ${E}_{3}^{i}$—which represent the average expression levels of a gene in clusters with values exceeding ${S}_{1}$ and not exceeding ${S}_{1}$, respectively—are each negatively correlated with ${\rho }_{i}$ and ${\eta }_{i}$, respectively (Supplementary Fig. 2B, C):$$\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} \, > \, 0,$$$$\frac{\partial {E}_{3}^{i}}{\partial {\eta }_{i}} \, < \, 0,$$where ${E}_{2}^{i}$ and ${E}_{3}^{i}$ for each gene will be compiled into a matrix for further analyses.Evaluating genes’ potential as positive and negative markers by ${{{\boldsymbol{\rho }}}}$ and ${{{\boldsymbol{\eta }}}}$
While marker genes are typically identified by their up-regulation in specific clusters, it is also crucial for an ideal marker gene to exhibit low expression in the remaining clusters at the same time, thereby ensuring its specificity (Supplementary Fig. 2A). We employ$$\rho \in R,0\le \rho \le 1,$$and$$\eta \in R,0\le \eta \le 1$$to quantify the potential of a gene serving as a positive or negative marker.We further define that an ideal marker gene should simultaneously exhibit high values for both $\rho$ and $\eta$. Then we should notice that as ${E}_{2}$ increases, the potential of a gene to be a marker gene will decrease (Supplementary Fig. 2B), while as ${E}_{3}$ increases, the potential of a gene to be a negative marker gene will also decrease (Supplementary Fig. 2C). So, for each gene i$$\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} \, < \, 0,$$$$\frac{\partial {E}_{3}^{i}}{\partial {\rho }_{i}} \, < \, 0.$$Measuring gene’s capability to be positive and negative markers with molecular indexTo quantify the potential of a gene to serve as a marker, we devised a novel metric termed the molecular index (MI). MI is defined as the subtraction of the positive molecular index (PMI) and the negative molecular index (NMI), which takes ${E}_{2}^{i}$ and ${E}_{3}^{i}$ into account, thereby offering a comprehensive measure of a gene’s marker potential (Supplementary Fig. 2D). The equations are as follows:$${{\rm {PM}{I}}}_{i}=\frac{1-{E}_{2}^{i}}{1-{S}_{1}},$$$${{\rm {NM}{I}}}_{i}=\frac{{E}_{3}^{i}}{{S}_{1}},$$$${\rm {M{I}}}_{i}={{\rm {PM}{I}}}_{i}-{{\rm {NM}{I}}}_{i}.$$In these equations, 1−${E}_{2}^{i}$ represents the range of the subtraction of actual ${E}_{2}^{i}$ and the maximum ${E}_{2}^{i}$, which equals to 1. Note that the range of this variable is influenced by $1-{S}_{1}$:$${1-E}_{2}^{i}\in \left[0,\frac{\left(1-{S}_{1}\right)\left({n}_{i}-1\right)}{{{{|}}}{p}_{i}{{{|}}}}\right).$$In contrast, ${S}_{1}$ represents the range of ${E}_{3}^{i}$ for samples with values less than ${S}_{1}$, which varies from 0 to ${S}_{1}$. The range of ${E}_{3}^{i}$ is also influenced by ${S}_{1}$:$${E}_{3}^{i}\in \left[0,{S}_{1}\right].$$To mitigate the potential impact of the range of ${1-E}_{2}^{i}$ and ${E}_{3}^{i}$ on the magnitude of these values when considering a given n, we normalize $({1-E}_{2}^{i})$ and ${E}_{3}^{i}$ by dividing them by (1−${S}_{1}$) and ${S}_{1}$, respectively. It is worth noting that PMI and NMI are positively and negatively correlated with ${\rho }_{i}$ and ${\eta }_{i}$, respectively (Supplementary Fig. 2E, F). MI is then defined as the difference between PMI and NMI. As such, we introduce PMI, NMI, and MI as our metrics.Assessing marker gene potential with ${{{\bf{M}}}}{{{{\bf{I}}}}}_{{{{\boldsymbol{i}}}}}$
Considering that MI is defined as the difference between PMIi and NMIi, it will positively correlate with $\rho$ due to the following relationships:$\frac{\partial {{{\rm {MI}}}}_{i}}{\partial {E}_{2}^{i}}=\,-\frac{1}{1-{S}_{1}} < \, 0$ and $\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} < \, 0$, therefore,$$\frac{\partial {\rm {M{I}}}_{i}}{\partial {\rho }_{i}}=\frac{\partial {\rm {M{I}}}_{i}}{\partial {E}_{2}^{i}}\,\frac{\partial {E}_{2}^{i}}{\partial {\rho }_{i}} \, > \, 0.$$This inequality shows that ${\rm {M{I}}}_{i}$ positively correlates with ${\rho }_{i}$, reflecting the potential of a gene to be a positive marker gene.Similarly, ${\eta }_{i}$ positively correlates with NMIi as indicated by the following inequality, where $\frac{\partial {\rm {M{I}}}_{i}}{\partial {E}_{3}^{i}}=\,-\frac{1}{{S}_{1}} < \, 0$ and $\frac{\partial {E}_{3}^{i}}{\partial {\rho }_{i}} < \, 0$, therefore:$$\frac{\partial {\rm {M{I}}}_{i}}{\partial {\eta }_{i}}=\frac{\partial {\rm {M{I}}}_{i}}{\partial {E}_{3}^{i}}\cdot \frac{\partial {E}_{3}^{i}}{\partial {\eta }_{i}} \, > \, 0.$$This demonstrates that ${\rm {M{I}}}_{i}$ positively correlates with ${\eta }_{i}$, indicating the potential of a gene to be a negative marker gene. Thus, the potential of a gene to be a marker for a cluster can be evaluated using ${\rm {M{I}}}_{i}$.Reordering genes in each cluster by n and ${\rm {M{I}}}_{i}$ to find optimal marker genesFollowing the calculation of the aforementioned statistics for gene i, including ${E}_{1}^{i}$, ${E}_{2}^{i}$, ${E}_{3}^{i}$, ${{{\rm {PMI}}}}_{i}$, ${{{\rm {NMI}}}}_{i}$ and ${{{\rm {MI}}}}_{i}$, it is worth noting that these calculations have been performed in the context of a given n. Genes with lower n values should have higher potential to serve as marker genes as there are fewer clusters passing the threshold ${S}_{1}$. Therefore, we rearrange the matrix based on the following principle:$$\left\{\begin{array}{c}{{\mbox{ascending}}}(|{p}_{i}|)\\ {{\mbox{decending}}}({{{\rm{M{I}}}}}_{i})\end{array}\right..$$The resulting matrix will be saved as an output of the “searchMarker” module.Measuring specificity with T
i
To measure the specificity level of potential marker genes, we introduce a statistic denoted as ${T}_{i}$ (Supplementary Fig. 2G)$${T}_{i}=\frac{\left|{{G}}_{i}\cap {G}_{{\mbox{clust}}}\right|}{\left|{G}\right|}.$$where${G}_{i}$ represents the set of the cells expressing gene i in-silico.${G}_{{{\rm {clust}}}}$ represents the set of cells where gene i serves as the in-silico marker gene.In the context of using marker gene i to label cells from a sample in-vivo/vitro, ${T}_{i}$ could be utilized to assess the extent to which cells genuinely belong to the cluster defined by the marker gene i, and can, therefore, be utilized to assess the specificity level of marker gene i.Rationale for selecting negative markersTo find negative markers, the overall process is similar to selecting positive markers except for the following adjustments: (1) we convert the original max-normalized matrix to a new matrix where the expression value of gene i in cluster j is replaced by $1-\widetilde{{x}_{{ij}}}$. In the following calculation, all the parameters will be calculated based on the new matrix. (2) An additional threshold ${S}_{3}$ is addressed during step 6, where the top ${S}_{3}$ (percentage, ${S}_{3}\in \left[{{\mathrm{0,1}}}\right)$) genes will be filtered out. (3) In step 8, a new PMIi based on the converted max-normalized matrix will be used to arrange the negative markers:$$\left\{\begin{array}{c}{{\mbox{ascending}}}(|{p}_{i}|)\\ \\ {{\mbox{decending}}}({\rm {PM}{I}}_{i})\end{array}\right..$$Rational of “filterMarker”“FilterMarker” is a useful tool for users who already have results from the Seurat “FindAllMarkers” function and want to refine their results. It provides an algorithm to re-sort the matrix and offer a more accurate list of marker genes for each cluster. Working in conjunction with Seurat, “FilterMarker” re-arranges the output matrix from Seurat to complement the “FindAllMarkers” function (Fig. 7C).Users can leverage this function by using “filterMarker” to sort the matrix from Seurat. The function takes the output matrix from “FindAllMarkers” as input and calculates ${T}_{i}$ based on the number of cells in each cluster and the values of “pct.1” and “pct.2” provided by the Seurat matrix. Then, genes are assigned to each cluster according to the cluster with the highest expression level, as measured by the average fold-change in log2 scale (“avg_log2FC”). The re-arrangement follows the principle of a descending ${T}_{i}$ for each cluster.Processing benchmark and testing dataProcessing single-cell/nuclear sequencing dataSequencing data and metadata for the human heart8 and mouse kidney9 were downloaded from the Single Cell Portal (SCP1303, SCP1245). Prefrontal cortex (PFC) data was obtained from the GEO database39, with the accession ID GSE1684087. R (v4.1.3) is used for the rest of the analysis. We created objects with Seurat (v4.3.0). Single-cell experiment data was normalized and scaled. We identified 3000 highly variable genes (HVG) and performed principal component analysis (PCA). The accumulated standard deviation of each principal component was calculated, and the principal component with an accumulated standard deviation >90% and a standard deviation <5% was recorded as n1. The subtraction of the standard deviation between each neighboring principal component was calculated, and the principal component with a subtraction of the standard deviation <0.1% was recorded as n2. The dimensions from the 1st to 1 plus the minimum of n1 and n2 were used for further analyses40. We performed uniform manifold approximation and projection (UMAP) with uwot umap for visualization.Benchmarking and parameter evaluationWe evaluated the running time and specificity level using three independent datasets and a series of 10 samples with a linear increase in cell population from around 50,000 to around 500,000. Tests were performed at different annotation levels. and the influence of ${S}_{2}$ on expression and specificity was evaluated. The specificity level improvements achieved by “filterMarker” were also evaluated. For manually generated rare cell types, we utilized the data of human PFC data and randomly generated 10 subsets of the inhibitory neurons.Simulating single-cell sequencing data and false discovery rateSplatter (version 3.18) package was used to simulate and generate a ground-truth single-cell RNA-Seq dataset. The simulation parameters were set as follows: de.faLoc = 3; de.facScale = 0.2; group.prob = c(0.2, 0.2, 0.2, 0.2, 0.2). To identify ground-truth topN marker genes, we employed a method outlined in a previously published benchmarking study to systematically select ground-truth marker genes20. Let i, ranging from 1 to L, denote the gene index, and k, ranging from 1 to K, denote the group index. The variable βij represents the differential expression (DE) indicator for gene i in group j within the splat model. Here, βij quantifies the differential expression level of gene i in group j compared to a baseline expression level. If βij equals 1, it suggests that gene i exhibits baseline expression in group j. To determine if a gene is a significant marker for a specific cluster, we use the score mi:$${m}_{i}=\frac{1}{K-1} \mathop {\sum }\limits_{r=1,r\ne i}\log \left(\frac{{{{{\rm{\beta }}}}}_{ij}}{{{{{\rm{\beta }}}}}_{ir}}\right).$$For a given cluster indexed as i, this score helps evaluate the marker potential of each gene.Steps for selecting simulated ground truth marker genes for a specific cluster from the simulated data consist of:1. Selecting genes that have simulated mean expression ≥0.1;2. Calculate the marker gene score mi for each gene;3. Rank genes by mi;4. Select the top n genes as marker genes;A gene was classified as a true positive if it appeared in both the selected and ground-truth lists, a false positive if it was only in the selected list, and a false negative if it was only in the ground-truth list. The false positive rate (FPR) was calculated as follows: FPR = false positives/(true positives + false positives).Data included for testing and benchmarkingWe utilized a series of single-cell/nucleus RNA-sequencing datasets to assess the performance of starTracer. We focused on the ability to efficiently identify high-specificity markers compared to the “FindAllMarkers” from Seurat. We included three sets of single-cell/nuclear RNA-sequencing data from different species and organs from Gene Expression Ominibus39 and Single Cell Portal (https://singlecell.broadinstitute.org/single_cell). The selection of data sets ensured the inclusion of diverse species, organs, varying sample sizes, and the utilization of both scRNA-seq and snRNA-seq techniques. We included objects created with Seurat4, annotations are provided by the authors. We conducted the basic analyses, including finding highly variable genes, normalizing the data, scaling the data, and running PCA and UMAP with each of the Seurat Object (refer to the “Methods” section for details).Statistics and reproducibilityStatistical analysis without indications was analyzed by t-test. p values < 0.05 were regarded as statistically significant. (Data graphics and statistical analysis were performed using R.) No representative results have been selected from the repeated experiments. Computational repeated experiments have been conducted in the same environment and hard wares.

starTracer is an accelerated approach for precise marker gene identification in single-cell RNA-Seq analysis

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis