A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data

The GEDI frameworkConsider a dataset with measurements for G genes/events in N cells. Let the column vector yn∈ℝG denote the expression values for G genes in cell n∈{1,…,N}. The vectors yn (n∈{1,…,N}) together form the matrix Y∈ℝG×N, in which each column n can be considered an observation in a G-dimensional space. We further assume that these observations lie near a lower-dimensional manifold, so that some function ψ can reconstruct each yn from a lower-dimensional column vector bn∈ℝK (K < G):$${{{{\rm{\psi }}}}}_{\theta }:{{\mathbb{R}}}^{K}\to {{\mathbb{R}}}^{G}$$
(1)
$${{{{\bf{y}}}}}_{n}\cong {{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)$$
(2)
Here, θ is the set of parameters that define the manifold, and bn represents the embedding of the n’th observation on this manifold.Furthermore, the N cells in the dataset may belong to different samples. Each sample i∈{1,…,Q} may have a different (distorted) manifold, defined by the parameter set θi = θr + Δθi. Therefore, when the cells are derived from multiple samples, each observation yn can be modeled as:$$\theta={\theta }_{r}+\triangle {\theta }_{i\left(n\right)}$$
(3)
$${{{{\bf{y}}}}}_{n}\cong {{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)$$
(4)
Here, Δθi(n) represents the difference between θr, the parameter set that defines a “reference” manifold, and θi(n), the parameter set that defines the manifold of the sample to which cell n belongs (we denote the sample from which the cell n is derived as i(n)). This formulation allows direct mapping between the cells (each defined by a constant embedding b) across multiple samples (through sample-specific manifold parameterization).The general concept above can potentially be adapted to various parametric manifold learning methods; GEDI represents a specific case, in which the gene expression manifold is modeled as a K-dimensional hyperplane (with the option to further restrict the manifold shape, as described later). This is achieved by defining the function ψ as:$$\theta=\left\{{{{\bf{o}}}},{{{\bf{Z}}}}\right\}$$
(5)
$${{{{\rm{\psi }}}}}_{\theta }\left({{{{\bf{b}}}}}_{n}\right)={{{\bf{o}}}}{{{\boldsymbol{+}}}}{{{\bf{Z}}}}{{{{\bf{b}}}}}_{n}$$
(6)
In a multi-sample analysis, sample-specific parameter sets are defined as:$${\theta }_{i}=\left\{{{{{\bf{o}}}}}_{r}+\triangle {{{{\bf{o}}}}}_{i},{{{{\bf{Z}}}}}_{r}+\triangle {{{{\bf{Z}}}}}_{i}\right\}$$
(7)
$$\Theta=\left\{{\theta }_{1},\ldots,{\theta }_{Q}\right\}$$
(8)
Here, the column vector or∈ℝG represents the origin point (center) on a reference hyperplane, and Δoi represents the sample-specific translation of the origin point. Each of the columns of the matrix Zr∈ℝG×K represents a vector that originates from point or and lies on the reference hyperplane. By default, GEDI restricts these K vectors to be orthogonal to each other, effectively forming the orthogonal axes of a coordinate frame in which the position of each point on the reference manifold can be specified as bn. ΔZi represents the sample-specific transformations of this coordinate frame (excluding translation, which is specified by Δoi).Thus, GEDI approximates each observation yn as:$${{{{\bf{y}}}}}_{n}\cong {{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}$$
(9)
More precisely, yn is modeled as an observation drawn from a spherical multivariate normal distribution whose mean is located on the manifold of sample i(n):$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$
(10)
$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$
(11)
Note the addition of the term sn1G here, which serves as a cell-specific intercept; sn is a scalar (representing library size), and 1 G is a column vector of 1’s; 1 G = {1,…,1}∈ℝG.The column vectors bn∈{1,…,N} together form the matrix B∈ℝK×N, in which each column n can be considered the embedding of the cell n in the manifold. Since the scales of Zr + ΔZi and B are redundant (each column of Zr + ΔZi can be scaled by some constant c and the corresponding row of B can be scaled by c–1 without changing the model likelihood), GEDI restricts B such that each row forms a unit vector:$${{{\bf{B}}}}\in \left\{{{\mathbb{R}}}^{K\times N},\Big|,\,\forall k\in \left\{1,\ldots,K\right\}\,{{\sum}_{n=1}^{N}}{\left({b}_{k,n}\right)}^{2}=1\,\right\}$$
(12)
Other constraints may also be added to further limit the shape of the manifold. For example, B may be restricted to the points on a ellipsoid; in other words:$$ {{{\bf{B}}}}\in \left\{{{\mathbb{R}}}^{K\times N},|,\,\forall k\in \left\{1,\ldots,K\right\}\,{{\sum}_{n=1}^{N}}{\left({b}_{k,n}\right)}^{2}=1\,\right\} \\ \qquad \cap \left\{{{\mathbb{R}}}^{K\times N},|,\exists {{{\bf{d}}}}{{{\boldsymbol{\in }}}}{{\mathbb{R}}}_{ > 0}^{K}\,s.t.\,\forall n\in \left\{1,\ldots,N\right\}\,{{\sum}_{k=1}^{K}}{\left({b}_{k,n}/{d}_{k}\right)}^{2}=1\right\}$$
(13)
Here, the vector d contains the lengths of the semi-axes of the ellipsoid, with dk representing the kth element of d.Modeling the reference manifold as a function of gene-level variablesThe reference manifold itself may be approximated as a function of gene-level prior knowledge, such as gene regulatory networks or pathway memberships, by expressing the parameter set θr as a function of gene-level variables. To this end, GEDI expresses Zr as a probabilistic function of C∈ℝG×P, where C is a matrix representing gene-level prior information matrix:$${{{{\bf{z}}}}}_{r,k}\left|{{{{\bf{a}}}}}_{k}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{C}}}}{{{{\bf{a}}}}}_{k},{\sigma }^{2}{S}_{Z}{{{\bf{I}}}}\right)$$
(14)
$${{{{\bf{a}}}}}_{k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{A}{{{\bf{I}}}}\right)$$
(15)
Here, zr,k∈ℝG is the k’th column of Zr, and ak∈ℝP is a vector column whose P elements represent the contribution of each of the P columns of C toward determining the direction of the axis vector zr,k—this formulation is similar to (and inspired by) that used by PLIER8 for pathway-level information extraction from gene expression data. SZ is a hyperparameter that determines the variance of zr,k relative to the model variance σ2. Similarly, SA is a hyperparameter that determines the variance of the prior distribution of ak relative to the model variance σ2 (see Supplementary Methods for the choice of SZ, SA, and other hyperparameters).The column vectors ak∈{1,…,K} together form the matrix A∈ℝP×K. For example, C may represent a (weighted) regulatory network connecting P transcription factors to G genes, in which case the element ap,k of the matrix A corresponds to the contribution of the network of the transcription factor p toward the axis vector zr,k. Abn can be considered as the projected “activity” of the P transcription factors in the cell n (Abn∈ℝP). In other words, αp(b)=ap,.b is the function that provides the projected activity of the transcription factor p at coordinate b in the Zr coordinate system, where ap,. is p’th row of A. Consequently, the gradient of the activity of transcript factor p in the Zr coordinate system (∇αp) is ap,.⊤, which can be transformed to the gene expression coordinate system as Zr∇αp= Zrap,.⊤.In the absence of gene-level prior information, the following prior is used for zr,k:$${{{{\bf{z}}}}}_{r,k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{Z}{{{\bf{I}}}}\right)$$
(16)
Modeling sample-specific distortions of the manifold as a function of sample-level variablesThe difference between the manifold of each sample i and the reference manifold can be expressed using the difference of the parameter sets that define these manifolds, i.e., Δθi. In the case of GEDI, we have: Δθi ={Δoi,Δzi,1,…, Δzi,K}. Each of the components can in turn be expressed as a function of sample-level variables:$${{{{\boldsymbol{\triangle }}}}{{{\bf{o}}}}}_{i}\left|{{{{\bf{R}}}}}_{o}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{R}}}}}_{o}{{{{\bf{h}}}}}_{i},{\sigma }^{2}{S}_{\triangle {o}_{i}}{{{\bf{I}}}}\right)$$
(17)
$${{{{\bf{R}}}}}_{o}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{{R}_{o}}{{{\bf{I}}}}\right)$$
(18)
$${{{{\boldsymbol{\triangle }}}}{{{\bf{z}}}}}_{i,k}\left|{{{{\bf{R}}}}}_{k}\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{R}}}}}_{k}{{{{\bf{h}}}}}_{i},{\sigma }^{2}{S}_{\triangle {Z}_{i}}{{{\bf{I}}}}\right)$$
(19)
$${{{{\bf{R}}}}}_{k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{{R}_{k}}{{{\bf{I}}}}\right)$$
(20)
Here, hi∈ℝL is a column vector whose elements represent the values of L variables for sample i. Ro∈ℝG×L and Rk∈ℝG×L are matrices that represent the effects of the L variables on Δoi and Δzi,k, respectively. SΔoi and SΔzi are sample-specific hyperparameters that specify the variance of the Δoi and Δzi,k relative to the model variance σ2. Similarly, SRo and SRk are hyperparameters that determine the variance of the prior distributions of Ro and Rk relative to the model variance σ2.In the absence of sample-level variables, Δoi and Δzi,k are modeled using the following prior distributions:$${{{{\boldsymbol{\triangle }}}}{{{\bf{o}}}}}_{i}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{\triangle {o}_{i}}{{{\bf{I}}}}\right)$$
(21)
$${{{{\boldsymbol{\triangle }}}}{{{\bf{z}}}}}_{i,k}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{\bf{0}}}},{\sigma }^{2}{S}_{\triangle {Z}_{i}}{{{\bf{I}}}}\right)$$
(22)
Direct inference from count dataConsider the count matrix M∈ℤG×N, with each element mg,n generated from a Poisson distribution with mean λg,n. GEDI models each λg,n as a latent variable drawn from a log-normal distribution, so that yn = (logλ1,n,…,logλG,n)T∈ℝG follows a spherical multivariate normal distribution as in the previous sections. Thus:$${m}_{g,n} \sim {{{\rm{Pois}}}}\left({e}^{{y}_{g,n}}\right)$$
(23)
$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$
(24)
$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$
(25)
Inference from paired UMI countsConsider the count matrices M1∈ℤG×N and M2∈ℤG×N, with each pair of elements m1,g,n and m2,g,n corresponding to success and failure counts, respectively, in mg,n = m1,g,n + m2,g,n Bernoulli trials with success probability pg,n. GEDI models yn = (logitp1,n,…,logitpG,n)T∈ℝG as a latent variable:$${m}_{1,g,n} \sim {{{\rm{B}}}}\left({m}_{g,n},{{{\rm{S}}}}\left({y}_{g,n}\right)\right)$$
(26)
$${{{{\bf{y}}}}}_{n}\left|{{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)\right.{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right),{\sigma }^{2}{{{\bf{I}}}}\right)$$
(27)
$${{{{\boldsymbol{\mu }}}}}_{n}\left(\Theta \right)={{{{\bf{o}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{o}}}}}_{i\left(n\right)}{{{\boldsymbol{+}}}}\left({{{{\bf{Z}}}}}_{r}{{{\boldsymbol{+}}}}\triangle {{{{\bf{Z}}}}}_{i\left(n\right)}\right){{{{\bf{b}}}}}_{n}+{s}_{n}{{{{\bf{1}}}}}_{G}$$
(28)
Here, S is the sigmoid function.Obtaining maximum a posteriori estimates of model parametersWhen the gene expression matrix Y is provided and no gene-level or sample-level prior information is available, GEDI uses a block coordinate descent algorithm to obtain maximum a posteriori estimates for Zr, ΔZi, or, Δoi and B. When Y is latent (i.e., when M or the pair M/M′ is provided), Zr is latent (i.e., gene-level variables are provided), and/or ΔZi and Δoi are latent (i.e., sample-level variables are provided), GEDI uses expectation-maximization to obtain maximum a posteriori estimates (see Supplementary Methods for details).Datasets and preprocessingAll datasets used in the article were downloaded from the publication access codes, from the publication’s online data repositories, or from public data repositories. Quality control and preprocessing steps were performed with the scuttle package45 (v 1.0.4). For further details, see Supplementary Methods.Integration methods and benchmarksWe compared the integration performance of GEDI against several other methods, including Seurat46, LIGER40, Harmony41, BBKNN47, scVI39, Scanorama48, and CSS38, as well as PCA (no integration) as a baseline. We ran each method following the documentation obtained from available tutorials or paper methods. Unless specified, we used the default parameters established by each package. For further details, see Supplementary Methods.We measured the ability of each method to remove technical variability while preserving biological variation associated with the cell types. To do this, we followed the approach established by previous integration benchmark efforts16,17, which used metrics that can be grouped into two broad categories: (a) removal of batch effects and (b) conservation of biological variance. Metrics from group (a) included alignment score (batch), iLISI, kBET, and ASW (batch), while group (b) metrics included alignment score (cell type), cLISI, ARI, NMI and ASW (cell type). For further details on each individual metric, see the Supplementary Methods section.For each method, we defined an overall score that summarized the performance of the multiple metrics, following the approach established by Luecken et al.16. Briefly, we first rescaled the output of every metric to range from 0 to 1, which ensures that each metric is equally weighted within a partial score and has the same discriminative power. The rescaling was done using the transformation y’=[y-min(Y)]/ [max(Y)-min(Y)]. Then, we defined the ‘Batch’ score’ and the ‘Bio’ score, representing the average for the metrics belonging to the groups (a) and (b) above, respectively. For a given integration method, we calculated the overall score as previously defined by Luecken et al.16, where a weighted average of the two Bio and Batch scores was used, with a weight of 0.6 for the Bio score and 0.4 for the Batch score (integration metrics using different weights can be found in Supplementary Fig. 3). Additional details are found in Supplementary Methods.Differential expression analysisWe performed pseudobulk differential expression (DE) analysis on the COVID-19 data using DESeq249 (v.30.1). For each cell type and donor combination, we created a pseudobulk using the ‘aggregateAcrossCells’ function from scuttle45. Only cell types that were present in all three conditions (control, mild and severe COVID-19) were considered for DE analysis.Cluster-free differential expression benchmarkWe conducted a systematic analysis to evaluate the performance of GEDI in clustering-free DGE analysis. Our assessment included the comparison of GEDI to two clustering-free DGE methods: LEMUR24 and miloDE25, as well as a comparison to DESeq2-based pseudobulk analysis. First, we developed a generative model that simulates cohort-level scRNA-seq data while preserving characteristics such as gene-gene and cell-cell correlations observed in real data, as discussed in detail in Supplementary Methods. Briefly, for each sample i, we simulate K archetypes representing a set of vertices on the gene expression manifold. If we represent the expression of each gene g across the K archetypes using the column vector γg,i∈ℝK, then we have:$${{{{\boldsymbol{\gamma }}}}}_{g,i}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{X}}}}}_{g}{{{{\bf{h}}}}}_{i}+{{{{\bf{x}}}}}_{g}^{{\prime} }{{{{\bf{h}}}}}_{i}{{{{\bf{1}}}}}_{K},{{{{\boldsymbol{\Sigma }}}}}_{g}\right)$$
(29)
where Xg is a matrix of coefficients (Xg∈ℝK×L) that shows, for each of the K archetypes, how the mean expression of gene g is determined by each of the L sample-level variables represented by the column vector hi. x’g∈ℝL is a row vector that represents the global effects of sample-level variables on the observed abundance of gene g, e.g., through affecting ambient RNA abundances or other artifacts50. Σg∈ℝK×K is a covariance matrix for gene g.Each cell n is then modeled as a probabilistic function of the weighted average of the archetypes of its sample:$${y}_{g,n}{{{\mathscr{ \sim }}}}{{{\mathcal{N}}}}\left({{{{\bf{w}}}}}_{n}{{{\boldsymbol{\cdot }}}}{{{{\boldsymbol{\gamma }}}}}_{g,i\left(n\right)},{\sigma }_{g}^{2}\right)$$
(30)
where wn∈ℝ+K is a row vector of weights connecting cell n to each of the K archetypes (Σkwn,k = 1), and σ2g is a gene-specific variance. Thus, the ground-truth DE effect of the L sample-level variables on gene g in cell n can be derived as the row-vector δn,g=wnXg.Finally, for each cell n, the UMI counts are simulated as:$${p}_{g,n}=\frac{{c}^{{y}_{g,n}}}{{\sum }_{g=1}^{G}{c}^{{y}_{g,n}}}$$
(31)
$${{{{\bf{m}}}}}_{n} \sim {{{{\rm{Multinom}}}}}_{G}\left({M}_{n},{{{{\bf{p}}}}}_{n}\right)$$
(32)
where c is the logarithmic basis for the simulated log-scale profiles and Mn is the total UMI count for cell n. This framework allows us to simulate UMI counts starting from a given set of Xg, Σg, and σg for all g∈{1,…,G}, and wn and Mn for all n∈{1,…,N}; we call these the simulation parameters, which can be modified to obtain different simulated datasets with varying numbers of cells, cell-cell or gene-gene correlation structures, cellular diversities, or ground-truth DE effects. In this work, we used the COVID-19 cohort 1 dataset as template to derive realistic simulation parameters, as described in Supplementary Methods, and generated a simulated dataset with the same number of cells and samples as those of the COVID-19 dataset. We then applied GEDI, LEMUR, miloDE, and DESeq2-based pseudobulk analysis to this dataset (see Supplementary Methods), and compared the differential expression estimates from each method to the ground truth DE vectors at the single-cell (GEDI and LEMUR), neighborhood (GEDI, LEMUR and miloDE) and cell-type levels (GEDI, LEMUR and DEseq2). For further details, see Supplementary Methods.Cell type signature analysisFor the generation of cell-type signatures in PBMC data, we performed DE analysis using DESeq2 for each donor. Our model compared the mean expression of a given cell type versus the average of other cell types, including the batch variable (different sequencing technologies) as a covariate in the model. We considered genes with FDR adjusted p- value < 0.05 and log2 fold-change >1 to define cell type markers. For a given donor, we defined a binary matrix that contained the cell-type markers, which was used as input for applying GEDI on the other donor.To perform the enrichment of cell type signatures and TF activities, we applied the scoreMarkers function from scran51 (v.1.30.0) using the inferred cell type/TF activities by GEDI.Gene regulatory network analysisFor analysis of transcription factor (TF) networks (in PBMCs), we downloaded the human DoRothEA gene-regulatory network52 and restricted the TF-gene interactions to the high-confidence sets A, B, and C, as previously defined. We then refined this TF-gene interaction matrix, using the whole blood data from GTEx53, to obtain a generative gene regulatory model in which the expression profile of each gene is a function of the expression of the TFs that are linked to it. Specifically, the target gene expression matrix Y∈ℝG×N (which is TMM-normalized54 and converted to log-scale) is modeled as Y ~ CA, where A∈ℝP×N is the matrix of the expression of P transcription factors across N whole blood samples from GTEx, and C∈ℝG×P is a weighted regulatory network whose elements are restricted to have the same sign as the regulatory interactions obtained from DoRothEA (cg,p ≥ 0 for an activating interaction between transcription factor p and gene g, cg,p ≤ 0 for an inhibitory interaction, and cg,p = 0 for no interaction). C was further filtered to include only TFs (columns) that have at least 10 “substantial” regulatory interactions, where we define a substantial effect as having an absolute value > 0.1.For analysis of post-transcriptional regulatory networks, we defined a regulatory network of RNA binding proteins (RBPs) and microRNAs (miRNAs) based on their potential interactions with mRNA 3’ UTRs. For the RBPs, we downloaded the set of known motifs for all human RBPs from CISBP-RNA55, and filtered them to include only the non-redundant set of motifs described in ref. 56. Then, for each human gene, the transcript with the longest 3’ UTR was identified. Genes whose longest 3’ UTR was shorter than 650nt were removed, and the first 650nt of the 3’ UTR (i.e., the 650nt region immediately downstream of the stop codon) of the remaining genes was used for motif scanning, using AffiMx57. The miRNA regulons were obtained from ref. 34.Assessment of estimated TF activitiesWe evaluated the accuracy of GEDI at estimating changes in TF activity by analyzing a previously published single-cell perturbation dataset28. Our evaluation also included a comparison to decoupleR58 (v.2.8.0), an ensemble of computational frameworks to infer biological activities. The methods we included in our evaluation were “aucell”, “gsva”, “mlm”, “ora”, “ulm”, “viper”, “wmean”, and “wsum”. For wmean and wsum, we utilized the normalized outputs “norm_wmean” and “norm_wsum”.The dataset consisted of two technical batches which were split and analyzed separately. One batch was used to generate a gene signature for each TF and also measure the degree of association of each TF with the principal axes of heterogeneity of the data, while the other was used to estimate TF activities using the learned gene regulatory effects. We opted to learn the gene signature of each TF from the data (as opposed to using an external gene regulatory network) to ensure that our results purely reflected the performance of the activity inference methods without the confounding effect of the quality of the external gene network. At the same time, by using one batch for learning gene-TF associations and the other for evaluating activity inferences, we aimed to avoid circularity. These steps were repeated by swapping the batches. To generate the gene signature of each TF, we performed differential expression analysis for each TF between the cells with perturbed TF expression versus unperturbed cells, using the scoreMarkers function from scran. The mean standardized Cohen’s d log-fold-change was used to define the regulatory effect of a TF on each gene. To identify the TFs whose perturbation is associated with a significant change in the principal axes of heterogeneity of the single-cell RNA-seq data, we ran PCA on the normalized expression data and retrieved the first 40 principal components. Next, for each TF, we fitted a logistic regression model to predict the TF perturbation status of a cell using its the principal component scores, and used a likelihood-ratio test to compare this model to a reduced model with only an intercept.Single-cell TF activities were estimated for each method using the learned gene-regulatory signatures. For methods that only accept a gene list for each TF (aucell, fgsea, gsva and ora), we defined the downstream targets of each TF as the set of genes with log2 fold-change <–0.3 in TF-perturbed cells in comparison to unperturbed cells. For the methods from decoupleR, normalized expression values were used as input, while for GEDI we used raw counts.Differential analysis of cell type-specific cassette exon splicingTo perform DE between GABAergic and Glutamatergic cells in the Tasic data, we applied limma59 (v.3.46) using the imputed value of the latent splicing matrix from GEDI, which represents the logit of PSI (percent-spliced-in). We applied the lmFit and eBayes functions to obtain DE estimates using the default parameters.GSEA analysis was performed using fgsea60 (v.1.16), with the following arguments: minSize=15, eps=0, and maxSize=500. For the analysis of cell-type exon inclusion events using the Tasic data, we used the M5 ontology gene sets from MSigDB61 (m5.all.v2022.1.Mm.symbols.gmt), using the t-statistics from limma as the gene ranks.Sashimi plots were generated using sashimipy62 (v 0.0.6). For the Tasic dataset, we selected a subset of 100 cells for each cell type and dataset combination, generated a merged BAM file, and provided it as input to sashimipy.Statistics and ReproducibilityUnless specified, cells were removed from the downloaded datasets based on standard quality control criteria (% of mitochondrial reads <20%; number of UMIs <1000, and number of detected genes <1000). No samples were collected in this study. No statistical method was used to predetermine sample size. Computational experiments and analysis are reproducible using the notebooks and code provided.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles