Robust identification of perturbed cell types in single-cell RNA-seq data

NormalizationOur method takes as input a normalized count matrix (with corresponding cell type annotations). We recommend using scTransform13 to normalize, although the method is compatible with any normalization approach. Let yijg be the UMI counts for gene 1 ≤ g ≤ G in cell i from sample j. scTransform fits the following model:$${y}_{ijg} \sim {{\rm{NB}}}({\mu }_{g},{\alpha }_{g})$$
(3)
$$\log {\mu }_{g}={\beta }_{0g}+{\beta }_{1g}\log {r}_{ij}$$
(4)
where rij is the total number of UMI counts for the particular cell. The normalized counts are given by the Pearson residuals of the above model:$${z}_{ijg}=\frac{{y}_{ijg}-{\hat{\mu }}_{g}}{\sqrt{{\hat{\mu }}_{g}+{\hat{\mu }}_{g}^{2}/{\hat{\alpha }}_{g}}}$$
(5)
Distance in normalized expression spaceIn this section, we describe the inferential procedure of scDist for cases without additional covariates. However, the procedure can be generalized to the full model (18) with arbitrary covariates (design matrix) incorporating random and fixed effects, as well as nested-effect mixed models. For a given cell type, we model the G-dimensional vector of normalized counts as$${\boldsymbol{z}}_{ij}={\boldsymbol{\alpha}}+{x}_{ij}\,{\boldsymbol{\beta}}+{\boldsymbol{\omega }}_{j}+{\boldsymbol{\varepsilon }}_{{ij}}$$
(6)
where ${\boldsymbol{\alpha}},{\boldsymbol{\beta}} \in {{\mathbb{R}}}^{G}$, xij is a binary indicator of condition, ${\boldsymbol{\omega }}_{j} \sim {{\mathcal{N}}}(0,{\tau }^{2}{I}_{G})$, and ${\boldsymbol{\varepsilon }}_{ij} \sim {{\mathcal{N}}}(0,{\sigma }^{2}{I}_{G})$. The quantity of interest is the Euclidean distance between condition means α and α + β:$$D:=\sqrt{{\boldsymbol{\beta }}^{T}{\boldsymbol{\beta}} }=\left| \left| {\boldsymbol{\beta}} \right| \right| _{2}$$
(7)
If $U\in {{\mathbb{R}}}^{G\times G}$ is an orthonormal matrix, we can apply U to equation (6) to obtain the transformed model:$$U{\boldsymbol{z}}_{ij}=U{\boldsymbol{\alpha}}+{x}_{ij}U{\boldsymbol{\beta}}+U{\boldsymbol{\omega} }_{j}+U{\boldsymbol{\varepsilon} }_{ij}$$
(8)
Since U is orthogonal, Uωj and Uεij still have spherical normal distributions. We also have that$${(U{\boldsymbol{\beta}} )}^{T}(U{\boldsymbol{\beta}} )={{\boldsymbol{\beta}} }^{T}{\boldsymbol{\beta}}={D}^{2}$$
(9)
This means that the distance in the transformed model is the same as in the original model. As mentioned earlier, our goal is to find U such that$${D}_{K}:=\sqrt{{\sum} _{k=1}^{K}{(U{\boldsymbol{\beta}} )}_{k}^{2}} \, \approx \, D$$
(10)
with K ≪ G.Let $Z\in {{\mathbb{R}}}^{n\times G}$ be the matrix with rows zij (where n is the total number of cells). Intuitively, we want to choose a U such that the projection of zij onto the first K rows of U (${u}_{1},\ldots,{u}_{K}\in {{\mathbb{R}}}^{G}$) minimizes the reconstruction error$$\sum _{i=1}^{n}| | {z}_{i}-(\mu+{v}_{i1}{u}_{1}+\cdots+{v}_{iK}{u}_{K})| {| }_{2}^{2}$$
(11)
where $\mu \in {{\mathbb{R}}}^{G}$ is a shift vector and $({v}_{ik})\in {{\mathbb{R}}}^{n\times K}$ is a matrix of coefficients. It can be shown that the PCA of Z yields the (orthornormal) u1, …, uK that minimizes this reconstruction error26.InferenceGiven an estimator $\widehat{{(U{\boldsymbol{\beta}} )}_{k}}$ of (Uβ)k, a naive estimator of DK is given by taking the square root of the sum of squared estimates:$$\sqrt{{\sum} _{k=1}^{K}{\widehat{{(U{\boldsymbol{\beta}} )}_{k}}}^{2}}.$$
(12)
However, this estimator can have significant upward bias due to sampling variability. For instance, even if the true distance is 0, $\widehat{{(U\beta )}_{k}}$ is unlikely to be exactly zero, and that noise becomes strictly positive when squaring.To account for this, we apply a post-hoc Bayesian procedure to the ${\widehat{U\beta }}_{k}$ to shrink them towards zero before computing the sum of squares. In particular, we adopt the spike slab model of14$$\widehat{{(U{\boldsymbol{\beta}} )}_{k}} \sim {{\mathcal{N}}}\left({(U{\boldsymbol{\beta}} )}_{k},{{\rm{Var}}}\left[\widehat{{(U{\boldsymbol{\beta}} )}_{k}}\right]\right)$$
(13)
$${(U{\boldsymbol{\beta}} )}_{k} \sim {\pi }_{0}{\delta }_{0}+\sum _{t=1}^{T}{\pi }_{t}{{\mathcal{N}}}(0,{\tau }_{t})$$
(14)
where ${{\rm{Var}}}[\widehat{{(U{\boldsymbol{\beta}} )}_{k}}]$ is the variance of the estimator $\widehat{{(U{\boldsymbol{\beta}} )}_{k}}$, δ0 is a point mass at 0, and π0, π1, …πT are mixing weights (that is, they are non-negative and sum to 1).14 provides a fast empirical Bayes approach to estimate the mixing weights and obtain posterior samples of (Uβ)k. Then samples from the posterior of DK are obtained by applying the formula (12) to the posterior samples of (Uβ)k. We then summarize the posterior distribution by reporting the median and other quantiles. Advantage of this particular specification is that the amount of shrinkage depends on the uncertainty in the initial estimate of (Uβ)k.We use the following procedure to obtain ${\widehat{U\beta }}_{k}$:

Use the matrix of PCA loadings as a plug in estimator for U. Then Uzij is the vector of PC scores for cell i in sample j.

Estimate (Uβ)k by using lme427 to fit the model (6) using the PC scores corresponding to the k-th loading (i.e., each dimension is fit independently).

Note that only the first K rows of U need to be stored.We are particularly interested in testing the null hypothesis of DK = 0 against the alternative Dd > 0. Because the null hypothesis corresponds to (Uβ)k = 0 for all 1 ≤ k ≤ d, we can use the sum of individual Wald statistics as our test statistic:$$W=\sum _{k=1}^{K}{W}_{k}=\sum _{k=1}^{K}{\left(\frac{{\widehat{(U{\boldsymbol{\beta}} )}}_{k}}{\widehat{{{\rm{se}}}}\left[{\widehat{(U{\boldsymbol{\beta}} )}}_{k}\right]}\right)}^{2}$$
(15)
Under the null hypothesis that (Uβ)k = 0, Wk can be approximated by a ${F}_{{\nu }_{k},1}$ distribution. νk is estimated using Satterthwaite’s approximation in lmerTest. This implies that$$W \sim \sum _{k=1}^{K}{F}_{{\nu }_{k},1}$$
(16)
under the null. Moreover, the Wk are independent because we have assumed that covariance matrices for the sample and cell-level noise are multiples of the identity. Equation (16) is not a known distribution but quantiles can be approximated using Monte Carlo samples. To make this precise, let W1, …, WM be draws from equation (16), where M = 105 and let W* be the value of equation (15) (i.e., the actual test statistic). Then the empirical p-value28 is computed as$$\frac{{\sum }_{i=1}^{M}I({W}_{i} \, > \, {W}^{*})+1}{M+1}$$
(17)
Controlling for additional covariatesBecause scDist is based on a linear model, it is straightforward to control for additional covariates such as age or sex of a patient in the analysis. In particular, model (18) can be replaced with$${\boldsymbol{z}}_{ij}={\boldsymbol{\alpha}}+{x}_{j}{\boldsymbol{\beta}}+\mathop{\sum }_{k=1}^{p}{w}_{ijk}{\boldsymbol{\gamma }}_{k}+{\boldsymbol{\omega }}_{j}+{\boldsymbol{\varepsilon }}_{ij}$$
(18)
where ${w}_{ijk}\in {\mathbb{R}}$ is the value of the kth covariate for cell i in sample j and ${\boldsymbol{\gamma }}_{k}\in {{\mathbb{R}}}^{G}$ is the corresponding gene-specific effect corresponding to the kth covariate.Choosing the number of principal componentsAn important choice in scDist is the number of principal components d. If d is chosen too small, then estimation accuracy may suffer as the first few PCs may not capture enough of the distance. On the other hand, if d is chosen too large then the power may suffer as a majority of the PCs will simply be capturing random noise (and adding to degrees of freedom to the Wald statistic). Moreover, it is important that d is chosen a priori, as choosing the d that produces the lowest p values is akin to p-hacking.If the model is correctly specified then it is reasonable to choose d = J − 1, where J is the number of samples (or patients). To see why, notice that the mean expression in sample 1 ≤ j ≤ J is$${x}_{\cdot j}{\boldsymbol{\beta}}+{\omega }_{j}\in {{\mathbb{R}}}^{G}$$
(19)
In particular, the J sample means lie on a (J − 1)-dimensional subspace in ${{\mathbb{R}}}^{G}$. Under the assumption that the condition difference and sample-level variability is larger than the error variance σ2, we should expect that the first J − 1 PC vectors capture all of the variance due to differences in sample means.In practice, however, the model can not be expected to be correctly specified. For this reason, we find that d = 20 is a reasonable choice when the number of samples is small (as is usually the case in scRNA-seq) and d = 50 for datasets with a large number of samples. This is line with other single-cell methods, where the number of PCs retained is usually between 20 and 50.Cell type annotation and “double dipping”scDist takes as input an annotated list of cells. A common approach to annotate cells is to cluster based on gene expression. Since scDist also uses the gene expression data to measure the condition difference there are concerns associated with “double-dipping” or using the data twice. In particular, if the condition difference is very large and all of the data is used to cluster it is possible that the cells in the two conditions would be assigned to different clusters. In this case scDist would be unable to estimate the inter-condition distance, leading to a false negative. In other words, the issue of double dipping could cause scDist to be more conservative. Note that the opposite problem occurs when performing differential expression between two estimated clusters; in this case, the p-values corresponding to genes will be anti-conservative29.To illustrate, we simulated a normalized count matrix with 4000 cells and 1000 genes in such a way that there are two “true” cell types and a true condition distance of 4 for both cell types (Fig. S23a). To cluster (annotate) the cells, we applied k-means with various choices of k and compared results by taking the median inter-condition distance across all clusters. As the number of clusters increases, the median distance decays towards 0, which demonstrates that scDist can produce false negatives when the data is over-clustered (Fig. S23b). To avoid this issue, one possible approach is to begin by clustering the data for only one condition and then to assign cells in the other condition by finding the nearest centroid in the existing clusters. When applied to the simulated data this approach is able to correctly estimate the condition distance even when the number of clusters k is larger than the true value.On real data, one approach to identify possible over-clustering is to apply scDist at various cluster resolutions. We used the expression data from the small COVID-19 data1 to construct a tree ${{\mathcal{T}}}$ with leaf nodes corresponding to the cell types in the original annotation provided by the authors (Fig. S24, see Appendix A for a description of how the tree is estimated). At each internal node $v\in {{\mathcal{T}}}$, we applied scDist to the cluster containing all children of v. We can then visualize the estimated distances by plotting the tree (Fig. S24). Situations where the child nodes have a small distance but the parent node has a large distance could be indicative of over-clustering. For example, PB cells are almost exclusiviely found in cases (1977 cells in cases and 86 cells in controls), suggesting that it is reasonable to consider PB and B cells as a single-cell type when applying scDist.Feature importanceTo better understand the genes that drive the observed difference in the CD14+ monocytes, we define a gene importance score. For 1 ≤ k ≤ d and 1 ≤ g ≤ G, the k-th importance score for gene g is ∣Ukg∣βg. In other words, the importance score is the absolute value of the gene’s k-th PC loading times its expression difference between the two conditions. Note that the gene importance score is 0 if and only if βg = 0 or Ukg = 0. Since the Ukg are fixed and known, significance can be assigned to the gene importance score using the differential expression method used to estimate βg.Simulated single-cell dataWe test the method on data generated from model equation (6). To ensure that the “true” distance is D, we use the R package uniformly30 to draw β from the surface of the sphere of radius D in ${{\mathbb{R}}}^{G}$. The data in Figs. 1C and 3C are obtained by setting β = 0 and σ2 = 1 and varying τ2 between 0 and 1.Weighted distanceBy default, scDist uses the Euclidean distance D which treats each gene equally. In cases where a priori information is available about the relevance of each gene, scDist provides the option to estimate a weighted distance Dw, where $w\in {{\mathbb{R}}}^{G}$ has non-negative components and$${D}_{w}=\sum _{g=1}^{G}{w}_{g}{\beta }_{g}^{2}$$
(20)
The weighted distance can be written in matrix form by letting $W\in {{\mathbb{R}}}^{G\times G}$ be a diagonal matrix with Wgg = wg, so that$${D}_{w}={\boldsymbol{\beta }}^{\top }W{\boldsymbol{\beta}}$$
(21)
Thus, the weighted distance can be estimated by instead considered the transformed model where $U\sqrt{W}$ is applied to each zij. After this different transformed model is obtained, estimation and inference of Dw proceeds in exactly the same way as the unweighted case.To test the accuracy of the weighted distance estimate, we considered a simulation where each gene had only a 10% chance of having βg ≠ 0 (otherwise ${\beta }_{g} \sim {{\mathcal{N}}}(0,1)$). We then considered three scenarios: wg = 1 if βg ≠ 0 and wg = 0 otherwise (correct weighting), wg = 1 for all g (unweighted), and wg = 1 randomly with probability 0.1 (incorrect weights). We then quantified the performance by taking the absolute value of the error between ${\sum }_{g}{\beta }_{g}^{2}$ and the estimated distance. Figure S3 shows that correct weighting slightly outperforms unweighted scDist but random weights are significantly worse. Thus, the unweighted version of scDist should be preferred unless strong a priori information is available.Robustness to model misspecificationThe scDist model assumes that the cell-specific variance σ2 and sample-specific variance τ2 are shared across genes. The purpose of this assumption is to ensure that the noise in the transformed model follows a spherical normal distribution. Violations of this assumption could lead to miscalibrated standard errors and hypothesis tests but should not effect estimation. To demonstrate this, we considered simulated data where each gene has σg ~ Gamma(r, r) and τg ~ Gamma(r/2, r). As r varies, the quality of the distance estimates does not change significantly (Fig. S26).Semi-simulated COVID-19 dataCOVID-19 patient data for the analysis was obtained from ref. 17, containing 1.4 million cells of 64 types from 284 PBMC samples collected from 196 individuals, including 171 COVID-19 patients and 25 healthy donors.Ground truthWe define the ground truth as the cell-type specific transcriptomic differences between the 171 COVID-19 patients and the 25 healthy controls. Specifically, we used the following approach to define a ground truth distance:

For each gene g, we computed the log fold changes Lg between COVID-19 cases and controls, with Lg = Eg(Covid) − Eg(Control), where Eg denotes the log-transformed expression data $\log (1+x)$.

The ground truth distance is then defined as $D={\sum }_{g}{L}_{g}^{2}$.

Subsequently, we excluded any cell types not present in more than 10% of the samples from further analysis. For true negative cell types, we identified the top 5 with the smallest fold change and a representation of over 20,000 cells within the entire dataset. When attempting similar filtering based on cell count alone, no cell types demonstrated a sufficiently large true distance. Consequently, we chose the top four cell types with over 5000 cells as our true positives Fig. S11.Using the ground truth, we performed two separate simulation analyses:

1: Simulation analyses I (Fig. 5A, B): Using one half of the dataset (712621 cells, 132 case samples, 20 control samples), we created 100 subsamples consisting of 5 cases and 5 controls. For each subsample, we applied both scDist and Augur to estimate perturbation/distance between cases and controls for each cell type. Then we computed the correlation between the ground truth ranking (ordering cells by sum of log fold changes on the whole dataset) and the ranking obtained by both methods. For scDist, we restricted to cell types that had a non-zero distance estimate in each subsample, and for Augur we restricted to cell types that had an AUC greater than 0.5 (Fig. 5A). For Fig. 5B, we took the mean estimated distance across subsamples for which the given cell type had a non-zero distance estimate. This is because in some subsamples a given cell type could be completely absent.

2: Simulation analyses II (Fig. 5C–F): We subsampled the COVID-19 cohort with 284 samples (284 PBMC samples from 196 individuals: 171 with COVID-19 infection and 25 healthy controls) to create 1,000 downsampled cohorts, each containing samples from 10 individuals (5 with COVID-19 and 5 healthy controls). We randomly selected each sample from the downsampled cohort, further downsampled the number of cells for each cell type, and selected them from the original COVID-19 cohort. This downsampling procedure increases both cohort variability and cell-number variations.

Performance Evaluation in Subsampled Cohorts : We applied scDist and Augur to each subsampled cohort, comparing the results for true positive and false positive cell types. We partitioned the sampled cohorts into 10 groups based on cell-number variation, defined as the number of cells in a sample with the highest number of cells for false-negative cell types divided by the average number of cells in cell types. This procedure highlights the vulnerability of computational methods to cell number variation, particularly in negative cell types.

Analysis of immunotherapy cohortsData collectionWe obtained single-cell data from four cohorts2,23,24,25, including expression counts and patient response information.Pre-processingTo ensure uniform processing and annotation across the four scRNA cohorts, we analyzed CD45+ cells (removing CD45− cells) in each cohort and annotated cells using Azimuth31 with reference provided for CD45+ cells.Model to account for cohort and sample varianceTo account for cohort-specific and sample-specific batch effects, scDist modeled the normalized gene expression as:$$Z \sim X+(1| \gamma :\omega )$$
(22)
Here, Z represents the normalized count matrix, X denotes the binary indicator of condition (responder = 1, non-responder = 0); γ and ω are cohort and sample-level random effects, and (1∣γ: ω) models nested effects of samples within cohorts. The inference procedure for distance, its variance, and significance for the model with multiple cohorts is analogous to the single-cohort model.SignatureWe estimated the signature in the NK-2 cell type using differential expression between responders and non-responders. To account for cohort-specific and patient-specific effects in differential expression estimation, we employed a linear mixed model described above for estimating distances, performing inference for each gene separately. The coefficient of X inferred from the linear mixed models was used as the estimate of differential expression:$$Z \sim X+(1| \gamma :\omega )$$
(23)
Here, Z represents the normalized count matrix, X denotes the binary indicator of condition (responder = 1, non-responder = 0); γ and ω are cohort and sample-level random effects, and (1∣γ: ω) models nested effects of samples within cohorts.Bulk RNA-seq cohortsWe obtained bulk RNA-seq data from seven cancer cohorts32,33,34,35,36,37,38, comprising a total of 789 patients. Within each cohort, we converted counts of each gene to TPM and normalized them to zero mean and unit standard deviation. We collected survival outcomes (both progression-free and overall) and radiologic-based responses (partial/complete responders and non-responders with stable/progressive disease) for each patient.Evaluation of signature in bulk RNA-seq cohortsWe scored each bulk transcriptome (sample) for the signature using the strategy described in ref. 39. Specifically, the score was defined as the Spearman correlation between the normalized expression and differential expression in the signature. We stratified patients into two groups using the median score for patient stratification. Kaplan–Meier plots were generated using these stratifications, and the significance of survival differences was assessed using the log-rank test. To demonstrate the association of signature levels with radiological response, we plotted signature levels separately for non-responders, partial-responders, and responders.Evaluating Augur Signature in Bulk RNA-Seq CohortsA differential signature was derived for Augur’s top prediction, plasma cells, using a procedure analogous to the one described above for scDist. This plasma signature was then assessed in bulk RNA-seq cohorts following the same evaluation strategy as applied to the scDist signature.Statistics and reproducibilityNo statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Robust identification of perturbed cell types in single-cell RNA-seq data

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Hot Topics

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Popular Articles

Turbocharging protein binding site prediction with geometric attention, inter-resolution transfer learning, and homology-based augmentation | BMC Bioinformatics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models