scConfluence: single-cell diagonal integration with regularized Inverse Optimal Transport on weakly connected features

NotationsFor two vectors \({{\bf{u}}}\in {{\mathbb{R}}}^{{n}_{u}}\) and \({{\bf{v}}}\in {{\mathbb{R}}}^{{n}_{v}}\), we use the notations:\({\left({{\bf{u}}}\otimes {{\bf{v}}}\right)}_{{ij}}={u}_{i}{v}_{j}\) and \({\left({{\bf{u}}}\oplus {{\bf{v}}}\right)}_{{ij}}={u}_{i}+{v}_{j}\). For two matrices \({{\bf{U}}}\in {{\mathbb{R}}}^{n\times d}\) and \({{\bf{V}}}\in {{\mathbb{R}}}^{n\times d}\) of identical dimensions, we’ll use the scalar product notation \({{\langle }},{{\rangle }}\) to denote the Frobenius inner product \(\left\langle {{\bf{U}}},{{\bf{V}}}\right\rangle={\sum }_{i,j}{U}_{{ij}}{V}_{{ij}}\).Optimal transportOptimal Transport (OT), as defined by Monge85 and Kantorovich86, aims at comparing two probability distributions by computing the plan transporting one distribution to the other with a minimal cost.While the OT theory has been developed in the general case of positive measures, our application only involves point clouds which are uniform discrete measures \({\sum }_{i=1}^{n}\frac{1}{n}{{{\rm{\delta }}}}_{{{{\bf{a}}}}_{{{\rm{i}}}}}\) where the set of \({{{\bf{a}}}}_{i}\) is the support of the point clouds. Therefore, to avoid adding unnecessary complexity in the notations we will denote the probability measures just as the set of positions \({{\bf{a}}}\).The classical OT distance, also known as the Wasserstein distance, between two point clouds \({{\bf{a}}}\in {{\mathbb{R}}}^{{n}_{{\mathbb{1}}}\times d}\) and \({{\bf{b}}}\in {{\mathbb{R}}}^{{n}_{{\mathbb{2}}}\times d}\) is defined as:$${\mbox{OT}}\left({{\bf{a}}},\, {{\bf{b}}},\, c\right)={\min }_{{{\bf{P}}}\in \Pi \left({n}_{1},{n}_{2}\right)}\left\langle {{\bf{P}}},\, {{\bf{c}}}\left({{\bf{a}}},\, {{\bf{b}}}\right)\right\rangle$$
(2)
Where \(\Pi \left({n}_{1},{n}_{2}\right)=\{{{\bf{P}}}\in {{\mathbb{R}}}_{+}^{{n}_{{\mathbb{1}}}\times {n}_{{\mathbb{2}}}}{{\rm{s}}}.{{\rm{t}}}.{{\bf{P}}}=\frac{1}{{n}_{1}}{{\bf{1}}},{{{\bf{P}}}}^{{{\rm{T}}}}=\frac{1}{{n}_{2}}{{\bf{1}}}\}\) and \(c\) is a ground cost function used to compute the pairwise dissimilarity matrix \({{\bf{c}}}\left({{\bf{a}}}{{,}}{{\bf{b}}}\right)=c{\left({{\bf{a}}}_{i},{{\bf{b}}}_{j}\right)}_{1\le i\le {n}_{1,}1\le j\le {n}_{2}}\in {{\mathbb{R}}}_{+}^{{n}_{{\mathbb{1}}}\times {n}_{{\mathbb{2}}}}\) that encodes the cost of transporting mass from one point (e.g. cell) to another. In this uniform discrete case, the coupling \({{\bf{P}}}\in \Pi \left({n}_{1},{n}_{2}\right)\) is a matrix that represents how the mass in the point cloud \({{\bf{a}}}\) is moved from one point to another in order to transform \({{\bf{a}}}\) into \({{\bf{b}}}\).As real data often contains outliers to which OT is highly sensitive, a more robust extension of OT called unbalanced OT33 has been developed.$${{\mathrm{OT}}}^{\tau }\left({{\bf{a}}},\, {{\bf{b}}},\, c\right)={\min }_{{{\bf{P}}}\in \Pi \left({n}_{1},{n}_{2}\right)}\left\langle {{\bf{P}}},\, {{\bf{c}}}\left({{\bf{a}}},\, {{\bf{b}}}\right)\right\rangle+\tau {{\rm{D}}}\left({{\bf{P}}}{{\bf{1}}}\left|\frac{1}{{n}_{1}}\right.{{\bf{1}}}\right)+\tau {{\rm{D}}}\left({{{\bf{P}}}}^{{{\rm{T}}}}{{\bf{1}}}\left|\frac{1}{{n}_{2}}\right.{{\bf{1}}}\right)$$
(3)
where \(\tau\) is a positive parameter controlling the looseness of the relaxation.In this formulation, the hard constraint on the marginals of the optimal plan is replaced with a soft penalization \({\mbox{D}}\) which measures the discrepancy between the marginals of the transport plan \({\bf{P}}\) and the uniform distributions on \({{\bf{a}}}\) and \({{\bf{b}}}\). While setting \(\tau=+ \infty\) recovers the balanced OT problem (Eq. 2), using \(\tau < +\infty\) allows the transport plan to discard outliers and deal with unbalanced populations. Indeed, in (Eq. 3), unbalanced OT achieves a tradeoff between the constraint to conserve the mass by transporting all of \({{\bf{a}}}\) onto \({{\bf{b}}}\) and the aim to minimize the cost of transport. When an outlier is too costly to transport, it is therefore discarded from the plan. A classical choice for \({\mbox{D}}\) is the Kullback–Leibler divergence. It is defined for two discrete probability distributions represented as vectors of probabilities \({\bf{p}}\) and \({\bf{q}}\) as \({{\rm{KL}}}\left({{\bf{p}}}|{{\bf{q}}}\right)={\sum }_{i}{p}_{i}\log \left(\frac{{p}_{i}}{{q}_{i}}\right)\). The Total Variation (TV) distance defined as \({{\rm{TV}}}\left({{\bf{p}}},{{\bf{q}}}\right)=\left|{p}_{i}-{q}_{i}\right|\) is also frequently used. The main difference between those two options is that when using TV, each point is either fully transported or discarded while using KL leads to transporting for each point a fraction of the mass which smoothly decreases as the cost of transport increases. We use both in different parts of our methods (see “Optimal Transport solvers”).Adding an entropic regularization to the objective function of (Eq. 2) results in a new optimization problem noted as \({\mbox{O}}{{\mbox{T}}}_{\varepsilon }\left({{\bf{a}}},{{\bf{b}}},c\right)\), where \(\varepsilon\) is a positive parameter quantifying the strength of the regularization.$${{\mathrm{OT}}}_{\varepsilon }\left({{\bf{a}}},\, {{\bf{b}}},\, c\right)={\min }_{{{\bf{P}}}\in \Pi \left({n}_{1},\, {n}_{2}\right)}\left\langle {{\bf{P}}},\, {{\bf{c}}}\left({{\bf{a}}},\, {{\bf{b}}}\right)\right\rangle+\varepsilon {{\rm{KL}}}\left({{\bf{P}}}|{{\bf{a}}}\otimes {{\bf{b}}}\right)$$
(4)
While setting \(\varepsilon=0\) recovers the unregularized OT problem (Eq. 2), using \(\varepsilon > 0\) makes the problem \(\varepsilon\)-strongly convex. It can be solved computationally much faster than its unregularized counterpart with the GPU-enabled Sinkhorn algorithm87.This entropic regularization can be used in the same fashion in (Eq. 3) to obtain the following problem:$${{\mbox{O}}}{{\mbox{T}}}_{\varepsilon }^{\tau }\left({{\bf{a}}},\, {{\bf{b}}},\, c\right)= {\min }_{{{\bf{P}}}\in \Pi \left({n}_{1},\, {n}_{2}\right)}\left\langle {{\bf{P}}},\, {{\bf{c}}}\left({{\bf{a}}},\, {{\bf{b}}}\right)\right\rangle+\varepsilon {{\rm{KL}}}\left({{\bf{P}}}{{|}}{{\bf{a}}}\otimes {{\bf{b}}}\right)+\tau {{\rm{D}}}\left({{\bf{P}}}{{\bf{1}}}|\frac{1}{{n}_{1}}{{\mathbf{1}}}\right)\\ +\tau {{\rm{D}}}\left({{{\bf{P}}}}^{{{\rm{T}}}}{{\mathbf{1}}}{{|}}\frac{1}{{n}_{2}}{{\mathbf{1}}}\right)$$
(5)
While \({\mbox{O}}{{\mbox{T}}}_{\varepsilon }^{\tau }\) provides a scalable (thanks to the Sinkhorn algorithm) and robust (thanks to the unbalanced relaxation) way to estimate the distance between point clouds, it shouldn’t be used as is for machine learning applications. Indeed, it suffers from a bias when \(\varepsilon > 0\) and is not a proper metric for measures. In particular, \({\mbox{O}}{{\mbox{T}}}_{\varepsilon }^{\tau }\left({{\bf{a}}},{{\bf{a}}},c\right) > 0\). To solve this issue, a debiased version of (Eq. 5) has been introduced as the unbalanced Sinkhorn divergence44:$${{\mbox{S}}}_{\varepsilon }^{\tau }\left({{\bf{a}}},{{\bf{b}}},c\right)={{\mathrm{OT}}}_{\varepsilon }^{\tau }\left({{\bf{a}}},{{\bf{b}}},c\right)-\frac{1}{2}{{\mathrm{OT}}}_{\varepsilon }^{\tau }\left({{\bf{a}}},{{\bf{a}}},c\right)-\frac{1}{2}{{\mathrm{OT}}}_{\varepsilon }^{\tau }\left({{\bf{b}}},{{\bf{b}}},c\right)$$
(6)
The Sinkhorn divergence \({{\mbox{S}}}_{\varepsilon }^{\tau }\) on the other hand is very well suited to define geometric loss functions for fitting parametric models in machine learning applications. Not only is it robust and scalable but it also verifies crucial theoretical properties such as being positive, definite, convex, and metrizing the convergence in law.To designate optimal transport problems, we’ll use the unified notations \({\mbox{O}}{{\mbox{T}}}_{\varepsilon }^{\tau }\) and \({{\mbox{S}}}_{\varepsilon }^{\tau }\) for all cases with \(\tau=+ \infty\) referring to the balanced case and \(\varepsilon=0\) referring to the unregularized case.scConfluencescConfluence takes as inputs data from \(M\) modalities with \(M\ge 2\) where each modality’s data comes in the form of a matrix \({{{\bf{X}}}}^{\left(p\right)}\in {{\mathbb{R}}}^{{n}^{\left(p\right)}\times {d}^{\left(p\right)}}\) where the \({n}^{\left(p\right)}\) rows correspond to cells and the \({d}^{\left(p\right)}\) columns are the features that are measured in the \(p\) th modality (e.g. genes, chromatin peaks, proteins). For each modality the vector \({{{\bf{s}}}}^{\left(p\right)}\) whose entries are the batch indexes of the cells in \({{{\bf{X}}}}^{\left(p\right)}\) is also available. Additionally, for all pairs of modalities \(\left(p,{p}^{{\prime} }\right)\), we have access to \({{{\bf{Y}}}}^{\left(p,{p}^{{\prime} }\right)}\in {{\mathbb{R}}}^{{n}^{\left(p\right)}\times {d}^{\left(p,{p}^{{\prime} }\right)}}\) and \({{{\bf{Y}}}}^{\left({p}^{{\prime} },p\right)}\in {{\mathbb{R}}}^{{n}^{\left({p}^{{\prime} }\right)}\times {d}^{\left(p,{p}^{{\prime} }\right)}}\) which correspond to \({{{\bf{X}}}}^{\left(p\right)}\) and \({{{\bf{X}}}}^{\left({p}^{{\prime} }\right)}\) translated to a common feature space. The method to obtain \({{{\bf{Y}}}}^{\left(p,{p}^{{\prime} }\right)}\) for each modality is detailed later in “Building the common features matrix” section.ScConfluence leverages all these inputs simultaneously but in different components to learn low-dimensional cell embeddings \({{{\bf{Z}}}}^{\left(p\right)}\in {{\mathbb{R}}}^{{n}_{p}\times {d}_{z}}\) in a shared latent space of dimension \({d}_{z}\). For each modality \(p\), we use one autoencoder \({\rm AE}^{\left(p\right)}\) on \({{{\bf{X}}}}^{\left(p\right)}\) with modality-specific architectures and reconstruction losses \({{{\mathcal{L}}}}_{{\rm AE}_{p}}\), see the “Training details” section.While variational autoencoders have become widely popular in single-cell representation learning, we decided not to use them. Indeed, variational autoencoders are trained by optimizing the ELBO which contains two terms, one for the reconstruction of the data and one which is the Kullback–Leibler divergence between the variational posterior and the prior distribution. This second term has been found to aim at a goal conflicting with the reconstruction and to lead to worse inference abilities88. With this in mind, we used classical autoencoders with an additional regularization. In our architecture, the encoder still outputs parameters of a Gaussian with diagonal covariance as a variational model would, but instead of forcing this distribution to be close to an uninformative Gaussian prior, we simply add a constant (0.0001) to the outputted standard deviation of the posterior distribution so that our model does not converge to a deterministic encoder during training. This stochasticity in the encoder acts as a regularization against overfitting as it forces the decoder to learn a mapping that is robust to small deviations around latent embeddings.To handle batch effects within modalities, the batch information \({{{\bf{s}}}}^{\left(p\right)}\) is used as a covariate of the decoder as done in existing autoencoder-based methods for omics data35. Conditioning the decoding of the latent code on its batch index allows our AEs to decouple the biological signal from the sample-level nuisance factors captured in different batches.Meanwhile, the \({{{\bf{Y}}}}^{\left(p,{p}^{{\prime} }\right)}\) matrices are leveraged to align cells across modalities using Optimal Transport. For each pair of modalities \(\left(p,{p}^{{\prime} }\right)\), we use the Pearson similarity (see Implementation details) to compute the cost matrix \({{{\bf{c}}}}_{{corr}}\left({{{\bf{Y}}}}^{\left(p,{p}^{{\prime} }\right)},{{{\bf{Y}}}}^{\left({p}^{{\prime} },p\right)}\right)\). Indeed, while the squared \({L}_{2}\) distance is classically used in OT, the Pearson similarity has been shown to better reflect differences between genomic measurements89. Using this cost matrix, we derive the unbalanced Optimal Transport Plan \({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\) which reaches the optimum in \({\mbox{O}}{{\mbox{T}}}_{\tilde{\varepsilon }}^{\tilde{\tau }}\left({{{\bf{Y}}}}^{\left(p,{p}^{{\prime} }\right)},{{{\bf{Y}}}}^{\left({p}^{{\prime} },p\right)},{c}_{{corr}}\right)\). \({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\) thus provides a partial plan to match corresponding cells from different modalities in the latent space. Using the unbalanced relaxation of OT to compute \({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\) enables scConfluence to efficiently deal with cell populations present only in one modality. Indeed, cell populations that are not shared across modalities will have a higher transport cost and are more likely to be part of the mass discarded by the unbalanced OT plan. Once \({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\) is obtained, it provides a correspondence map between modalities which determines which embeddings should be brought closer in the latent space. Since diagonal integration’s goal is to embed closely cells that are biologically similar, we enforce a loss term whose specific goal is this:$${{{\mathcal{L}}}}_{{IOT}}^{\left(p,{p}^{{\prime} }\right)}=\left\langle {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)},\, {{{\bf{c}}}}_{{L}_{2}}\left({{{\bf{Z}}}}^{\left(p\right)},\, {{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)\right\rangle$$
(7)
where \({c}_{{L}_{2}}\) is the squared \({L}_{2}\) distance such that \({{{\bf{c}}}}_{{L}_{2}}\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)={\left({{||}{{{\bf{Z}}}}_{i}^{\left(p\right)}-{{{\bf{Z}}}}_{j}^{\left({p}^{{\prime} }\right)}{||}}_{2}^{2}\right)}_{1\le i\le {n}^{\left(p\right)},1\le j\le {n}^{\left({p}^{{\prime} }\right)}}\).Minimizing \({{{\mathcal{L}}}}_{{IOT}}^{\left(p,{p}^{{\prime} }\right)}\) leads to reducing the distance only between the cell embeddings which are matched by \({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\). We add to this loss a regularization term which reduces the global distance between the set of embeddings in \({{{\bf{Z}}}}^{\left(p\right)}\) and those in \({{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\). This allows us to make sure that we do not only juxtapose corresponding cell populations from different modalities, but that they overlap in the shared latent space. To enforce this regularization, we use the unbalanced Sinkhorn divergence (Eq. 6) as both its computational and theoretical properties make it an ideal regularization function for our goal.All those different objectives contribute together to the following final loss which we optimize over the parameters of the neural networks \({{\rm AE}}^{\left(p\right)}\) with stochastic gradient descent:$${\mathcal L}= {\sum }_{p \!=\! 1}^{M}{\lambda }_{p}{ {\mathcal L} }_{{{\mathrm{AE}}}({{\mathrm{p}}})}+{\sum}_{1 \! \le \! p < p^{\prime}\! \le\! M}{\lambda }_{IOT}\left\langle {{{\bf{P}}}}^{(p,p^{\prime} )},\, {{{\bf{c}}}}_{L_2}\left({{{\bf{Z}}}}^{(p)},\, {{{\bf{Z}}}}^{(p^{\prime} )}\right)\right\rangle \\ +{\lambda }_{r}{S}_{\varepsilon }^{\tau }\left({{{\bf{Z}}}}^{(p)},\, {{{\bf{Z}}}}^{(p^{\prime} )},\, {c}_{L_2}\right)$$
(8)
Where the \({\lambda }_{p}\), \({\lambda }_{{IOT}}\) and \({\lambda }_{r}\) are positive weights controlling the contribution of each different loss terms.Connection to regularized Inverse Optimal TransportOur final loss (Eq. 8) can be decomposed in two main objectives, on one side the reconstruction losses whose goal is to extract the maximum amount of information out of each modality, on the other side the alignment loss \({{{\mathcal{L}}}}_{{{\rm{align}}}}\left({{{\bf{Z}}}}^{\left(p\right)},\, {{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)\), whose goal is to align cells across modalities in the shared latent space.$${{{\mathcal{L}}}}_{{{\rm{align}}}}\left({{{\bf{Z}}}}^{\left(p\right)},\, {{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)={\lambda }_{{IOT}}\left\langle {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)},\, {{{\bf{c}}}}_{{L}_{2}}\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)\right\rangle+{\lambda }_{r}{{{\rm{S}}}}_{\varepsilon }^{\tau }\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)},{c}_{{L}_{2}}\right)$$
(9)
There is an intimate connection between \({{{\mathcal{L}}}}_{{{\rm{align}}}}\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)\) and the theory of Inverse Optimal Transport (IOT).Regularized Inverse Optimal Transport (rIOT)32 refers to the problem of learning a pairwise dissimilarity matrix \({{\bf{C}}}\) from a given transport plan \({{\bf{P}}}\in \Pi \left({n}_{1},{n}_{2}\right)\), with a certain regularization on \({{\bf{C}}}\). In our case, it can be formalized as the following convex optimization problem:$${{\rm{rIO}}}{{{\rm{T}}}}_{\varepsilon }\left({{\bf{P}}}\right)={\min }_{{{\bf{a}}},{{\bf{b}}}}{{\rm{KL}}}\left({{\bf{P}}}{{|}}{{{\bf{Q}}}}_{\varepsilon }\left({{\bf{a}}},\, {{\bf{b}}}\right)\right)+{{R}}\left({{\bf{a}}},\, {{\bf{b}}}\right)$$
(10)
where \({{{\bf{Q}}}}_{\varepsilon }\left({{\bf{a}}},\, {{\bf{b}}}\right)\) is the balanced optimal transport plan achieving the optimum in \({{\rm{O}}}{{{\rm{T}}}}_{\varepsilon }^{+\infty }\left({{\bf{a}}},\, {{\bf{b}}},\, {c}_{{L}_{2}}\right)\) and \(R\) is a user-defined regularization. In our case, we want this regularization to force points coupled by \({{\bf{P}}}\) to completely overlap.We prove that in the particular case of balanced plans, which corresponds to setting \(\tilde{\tau }=+ \infty\) and \(\tau=+ \infty\) in our method, and with the regularizing function \(R\left({{\bf{a}}},{{\bf{b}}}\right)=\frac{1}{\varepsilon }{{\rm{O}}}{{{\rm{T}}}}_{\varepsilon }^{+\infty }\left({{\bf{a}}},{{\bf{b}}},{c}_{{L}_{2}}\right)+\frac{{\lambda }_{r}}{\varepsilon {\lambda }_{{IOT}}}{{{\rm{S}}}}_{\varepsilon }^{+\infty }\left({{\bf{a}}},{{\bf{b}}},{c}_{{L}_{2}}\right)\), minimizing \({{{\mathcal{L}}}}_{{{\rm{align}}}}\) with respect to \({{{\bf{Z}}}}^{\left(p\right)}\) and \({{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\) is equivalent to solving \({{\rm{rIO}}}{{{\rm{T}}}}_{\varepsilon }\left({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\right)\). More formally, we prove that:$${{\mathrm{argmi}}}{{\mathrm{n}}}_{{{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}}{{{\mathcal{L}}}}_{{{\rm{align}}}}\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)=\, {{\mathrm{argmi}}}{{\mbox{n}}}_{{{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}}{{\rm{KL}}}\left({{\bf{P}}}{{|}}{{{\bf{Q}}}}_{\varepsilon }\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)\right) \\ +R\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)$$
(11)
The proof of Eq. (11) uses the following lemma (See Supplementary Note 1).Lemma: Let \({{\bf{a}}}\) and \({{\bf{b}}}\) be two point clouds of size \({n}_{1}\) and \({n}_{2}\) respectively. Given \({{\bf{P}}}\in \Pi \left({n}_{1},{n}_{2}\right)\) and denoting as \({{{\bf{Q}}}}_{\varepsilon }\left({{\bf{a}}},{{\bf{b}}}\right)\) the balanced entropic optimal transport plan achieving the optimum in \(O{T}_{\varepsilon }^{+\infty }\left({{\bf{a}}},{{\bf{b}}},{c}_{{L}_{2}}\right)\), the following equality holds:$${{\rm{KL}}}\left({{\bf{P}}}|{{{\bf{Q}}}}_{\varepsilon }\left({{\bf{a}}},{{\bf{b}}}\right)\right)=\left\langle {{\bf{P}}},\log \left({{\bf{P}}}\right)\right\rangle+\frac{1}{\varepsilon }\left\langle {{\bf{P}}},{{{\bf{c}}}}_{{L}_{2}}\left({{\bf{a}}}{{,}}{{\bf{b}}}\right)\right\rangle -\frac{1}{\varepsilon }{{\rm{O}}}{{{\rm{T}}}}_{\varepsilon }^{+\infty }\left({{\bf{a}}},{{\bf{b}}},{c}_{{L}_{2}}\right)-1$$
(12)
Using the lemma (Eq. 12) and the definition of \(R\) we prove (Eq. 11) by rewriting \({rIO}{T}_{\varepsilon }\left({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\right)\) as:$${{\rm{rIO}}}{{{\rm{T}}}}_{\varepsilon }\left({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\right)= {\min }_{{{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}}\left\langle {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)},\log {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\right\rangle+\frac{1}{\varepsilon }\left\langle {{\bf{P}}},{{{\bf{c}}}}_{{L}_{2}}\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)\right\rangle \\ -\frac{1}{\varepsilon }{{\rm{O}}}{{{\rm{T}}}}_{\varepsilon }^{+\infty }\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)},{c}_{{L}_{2}}\right)-1 \\ +\left(\frac{1}{\varepsilon }{{\rm{O}}}{{{\rm{T}}}}_{\varepsilon }^{+\infty }\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)},{c}_{{L}_{2}}\right)+\frac{{\lambda }_{r}}{\varepsilon {\lambda }_{{IOT}}}{{{\rm{S}}}}_{\varepsilon }^{+\infty }\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)},{c}_{{L}_{2}}\right)\right)\\= {\min }_{{{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}}\left\langle {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)},\log {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\right\rangle+\frac{1}{\varepsilon }\left\langle {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)},{{{\bf{c}}}}_{{L}_{2}}\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)\right\rangle \\ +\frac{{\lambda }_{r}}{\varepsilon {\lambda }_{{IOT}}}{{{\rm{S}}}}_{\varepsilon }^{+\infty }\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)},{c}_{{L}_{2}}\right)-1\\= {\min }_{{{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}}\left\langle {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)},\log {{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\right\rangle+\frac{1}{\varepsilon {\lambda }_{{IOT}}}{{{\mathcal{L}}}}_{{{\rm{align}}}}\left({{{\bf{Z}}}}^{\left(p\right)},{{{\bf{Z}}}}^{\left({p}^{{\prime} }\right)}\right)-1$$
(13)
By noticing in (Eq. 13) that neither \(\langle {{{\bf{P}}}}^{(p,{p}^{{\prime} })},\log {{{\bf{P}}}}^{(p,{p}^{{\prime} })}\rangle\) nor the scaling factor \(\frac{1}{\varepsilon {\lambda }_{{IOT}}}\) depends on \(({{{\bf{Z}}}}^{(p)},{{{\bf{Z}}}}^{({p}^{{\prime} })})\), we obtain (Eq. 11).Training detailsNeural network architecturesThe encoders and decoders are three-layer neural networks with ReLU activation functions inspired by the architecture of the scVI VAE. We used a latent dimension of 16 for all datasets but adapted the number of neurons in hidden layers to the dimensionality of the datasets (see Supplementary Table 5). On scATAC and scRNA datasets which contained thousands of features, we did a first dimension reduction with PCA and used the 100 principal components as inputs of the encoder while the decoder outputted a reconstruction in the original feature spaces which were compared with the data prior to the PCA projection. For proteomic and smFISH modalities which contained much fewer features, we reduced the number of layers of both encoders and decoders to two. We used the same decoder architecture as scVI with the Zero Inflated Negative Binomial (ZINB) likelihood for the reconstruction loss on scRNA data. For other modalities, however, we replaced the scVI decoder with a simple fully connected multi-layer perceptron and used the squared \({L}_{2}\) distance as the reconstruction loss.Optimal transport solversWe used the Python package POT to compute the plans \({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\) with the function ot.partial.partial_wasserstein. This implementation of unbalanced optimal transport uses the Total variation distance for the penalization of marginals. It is parameterized by the Lagrangian multiplier \(m\) associated with \(\tilde{\tau }\) to control the unbalancedness of the plan. \(m\) is a parameter between 0 and 1 which quantifies how much mass is transported by the optimal plan. The use of TV to penalize the unbalanced relaxation allows \({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\) to completely ignore cell populations that are identified to have no equivalent in other modalities. We set by default \(m=0.5\) which produces robust performances when prior information about the level of unbalanceness of the data is not available, see Supplementary Fig. 13. If additional information is available, \(m\) can be set accordingly to obtain better results, a higher value being better for situations where the data is more balanced across modalities. We use no entropic regularization in the computation of \({{{\bf{P}}}}^{\left(p,{p}^{{\prime} }\right)}\) (\(\tilde{\varepsilon }=0\)) as POT’s CPU implementation was already fast enough on our mini-batches for us to afford to avoid using an approximation.For the unbalanced Sinkhorn divergence we used the Python package Geomloss90 which has very efficient GPU implementations with a linear memory footprint. Indeed, while it cannot take as input a custom cost matrix as POT does, when the cost function is the squared \({L}_{2}\) distance (as is the case for our regularization term) Geomloss uses KeOps91 to implement efficient operations with a small memory footprint and automatic differentiation. Geomloss uses the KL to penalize the unbalanced relaxation. We used the following hyperparameters: “p” = 2, “blur” = 0.01 (which corresponds to \(\varepsilon=0.0001\)), “scaling” = 0.8, “reach” = 0.3 (which corresponds to \(\tau=0.09\)).Training hyperparametersAll models were optimized using the PyTorch lightning library. We used the ADAMW optimizer92 with a learning rate of 0.003. The batch size was set to 256 times the number of modalities. 20% of the dataset was held-out for validation and an early stopping was triggered when the validation loss didn’t improve for 40 epochs. As commonly done in the state-of-the-art29,31,35, we then use all samples (both train and validation) after training to compute cell embeddings and evaluation metrics. In our task, the goal is to encode the whole given dataset on which the model was trained. Unseen samples and generalizability are not relevant for this problem since the models we train are not meant to be then used on different datasets at inference time. There is no information leakage either since the ground-truth information (i.e. cell type labels and pairing information) used to evaluate the methods are not used during training. \({{{\rm{\lambda }}}}_{{{\rm{p}}}}\) was set to 1.0 for all modalities except for ATAC where it was set to 5.0 due to the larger amount of content measured in the ATAC modality and this was the case for all datasets without further need for tuning. For \({\lambda }_{{IOT}}\) which controls the impact of the IOT term inside the full loss, the default value was set to 0.01. Nonetheless, this term can be tuned depending on the reliability and quality of the connected features used to compute the IOT loss term. Indeed, in situations where stronger connections were available across modalities, e.g. when comparing gene expression measurements to gene expression measurements in the mouse cortex datasets, we increased the value of this parameter to 0.05 but we would not recommend using values outside of the [0.01, 0.1] range. The \({\lambda }_{r}\) hyperparameter controls the importance of the Sinkhorn regularization term whose goal is to force corresponding populations across modalities to overlap in the latent space. Theoretically, the best value for \({\lambda }_{r}\) is the lowest which allows a complete overlapping of cells across modalities. We found 0.1 to be a good default value for this hyperparameter but decreased it to 0.03 when integrating 3 modalities at a time since we are summing three regularization terms in this case. We wouldn’t recommend using a value outside of the [0., 0.5] range. See our documentation for more details on those hyperparameters at https://scconfluence.readthedocs.io/en/latest/.Computational runtimeSee Supplementary Table for a comparison of the computation times of the different methods on the PBMC 10X dataset (where approximately 10k cells were profiled for both modalities). Overall, scConfluence’s running time is comparable with the state-of-the-art. It is indeed slower than the three CPU methods (Seurat, liger, and MultiMAP). Nonetheless, these methods operate on the full dataset in one go and therefore can’t be applied to very large datasets. On the other hand, scConfluence is faster than the other two neural network-based methods, all three being much more scalable due to their mini-batch approach.Outputs of scConfluenceOnce the scConfluence model is trained, it can be used to obtain mainly two kinds of output. Firstly, the trained encoders from each modality are used to encode all cells in the integrated latent space. Those cell embeddings can then be used for any downstream analysis such as clustering. Secondly, the model can be used to impute features across modalities by composing the encoder of one modality with the decoder of another modality. While this can be very useful, users should keep in mind that this technique doesn’t provide exact “translations” and because of that, this approach is not cycle-consistent, as shown in Supplementary Fig. 14.Data preprocessingDetails of the different datasets used are available in the Supplementary Table 1. Additionally, the proportions of cell types present in each modality for unpaired datasets are available in Supplementary Figs. 15–18.scRNA preprocessingWe performed quality control filtering of cells on the proportion of mitochondrial gene expression, the number of expressed genes, and the total number of counts (using Muon’s filter_obs). Quality control filtering of genes was performed on the number of cells expressing the gene (using Muon’s filter_var). We then kept a copy of the raw counts data before applying the log-normalization which consists of normalizing counts for each cell so that they sum to 10,000 (using Scanpy’s normalize_total) and then log transforming them (using Scanpy’s log1p). To subselect genes we took the union between the set of 3000 most variable genes in the normalized counts (using Scanpy’s highly_variable_genes with flavor = ‘seurat’) and the set of 3000 most variable genes in raw counts (using Scanpy’s highly_variable_genes with flavor = ‘seurat_v3’). Finally, the log-normalized counts were used to compute the first 100 principal components which served as the input of the decoder while we kept a copy of the raw counts to evaluate the output of the decoder using the ZINB likelihood (except for the Patch-seq dataset where we used a fully connected decoder with the squared \({L}_{2}\) loss on the log-normalized counts).scATAC preprocessingWe performed quality control filtering of cells on the number of open peaks and the total number of counts (using Muon’s filter_obs). Quality control filtering of peaks was performed on the number of cells where the peak is open (using Muon’s filter_var). We didn’t apply any further subselection of the peaks after the quality control. Cells were normalized using the TF-IDF normalization (using Muon’s tfidf). Finally, the first 100 principal components of the normalized data were used as input to the encoder while the unreduced TF-IDF normalized data was used to evaluate the output of the decoder with a squared \({L}_{2}\) loss.Protein preprocessing (in Cite-seq and CyTOF)Since the number of measured proteins is small and this data is less noisy than scRNA or scATAC, no quality control or feature selection was performed. We normalized the data using Muon’s implementation of the Center Log Ratio technique. This processed data was used for both the encoder and the decoder (with a squared \({L}_{2}\) loss).smFISH preprocessingWe performed quality control filtering of cells on the proportion of mitochondrial gene expression, the number of expressed genes, and the total number of counts (using Muon’s filter_obs). Quality control filtering of genes was performed on the number of cells expressing the gene (using Muon’s filter_var). For the smFISH gene counts we used the same normalization technique as in the original study: we normalized by both the total number of molecules of all genes in each cell and the sum of each gene over all cells. This processed data was used for both the encoder and the decoder (with a squared \({L}_{2}\) loss).Patch-seq morphologies preprocessingWe retrieved the neuronal morphologies as 3D point clouds stored in.SWC files and did not have to do any quality control since only high-quality morphologies could be reconstructed. We then used the NeuroM package93 to load the morphologies and project them onto the xy-plane (which is actually the xz plane since y and z were switched in the raw files) while coloring each point according to its neuronal compartment type (dendrites in red, axons in blue and soma in black). We then input those images in Google’s Inception v3 pre-trained deep neural network to extract features by retrieving the output of the last layer (with 2,048 dimensions). We then concatenated all these feature vectors in a matrix. This processed data was used for both the encoder and the decoder (with a squared \({L}_{2}\) loss).Building the common features matrixThe first step to construct the cross-modality cost matrices consisted in obtaining the \({{{\bf{Y}}}}^{\left(p,{p}^{{\prime} }\right)}\) and \({{{\bf{Y}}}}^{\left({p}^{{\prime} },p\right)}\) matrices.

With scRNA and scATAC data, this consisted in obtaining the gene activity matrix and subsetting the two matrices to the set of common genes. We obtained the gene activities using different techniques depending on the metadata available for each dataset. For the cell lines data we used Maestro26, for the Multiome PBMC data we used Signac25, the gene activities for the Open problems Multiome dataset had been already computed by the authors with Signac, and for the tri-omics PBMC dataset we ran the R script provided by the authors on the GitHub repository of their study https://github.com/GreenleafLab/10x-scATAC-2019/blob/master/code/04_Run_Cicero_v2.R using Cicero94.

With scRNA and Protein data, this consisted in manually inspecting the genecards website to find for each protein its associated coding gene and then subsetting the RNA and Protein data to the pairs available in both modality’s features.

With scATAC and Protein, we did the same as with RNA and Protein after obtaining the gene activities from ATAC.

With RNA and smFISH, since all genes measured in the smFISH experiment were also measured in the scRNA dataset we simply subset the scRNA genes to keep only the common genes.

With RNA and Patch-seq morphologies, since for both groups of cells we had access to the scRNA measurements we could directly use those as common features.

Building the biological cost matrixHaving obtained the converted data matrices \({{{\bf{Y}}}}^{\left(p,{p}^{{\prime} }\right)}\) and \({{{\bf{Y}}}}^{\left({p}^{{\prime} },p\right)}\), we then applied to each modality’s data (ATAC gene activities were treated as RNA) the same preprocessing as described earlier. We then scaled both \({{\bf{Y}}}\) matrices (except for the Patch-seq since we were comparing scRNA data from the same dataset) and computed the cost matrix by using the correlation distance between each pair of cells from the two modalities using scipy’s cdist.BaselinesSeuratWe compare scConfluence to Seurat v3 as the v3 refers to the version aimed at tackling diagonal integration. In practice, we used the R package Seurat v4.3.0 which finds anchor pairs between cells from different modalities by searching for Mutual Nearest Neighbors after having reduced the dimension of the data with Canonical Correlation Analysis (CCA). Before running the CCA, all modalities are converted to the same features so we followed the same protocol as described above in “Building the common features matrix” section, as it coincides with the indications described in the tutorials available in the Seurat documentation. We ran the Seurat method with default parameters, except for the Protein and smFISH datasets where we set the latent dimension to 15 since the default number of latent dimensions was close to or even higher than the number of features measured. For gene imputation in the scRNA-smFISH experiment, we used the TransferData function as indicated in the documentation.LIGERWe compare scConfluence to Liger using the R package rliger v1.0.0. Liger relies on integrative non-negative matrix factorization (NMF) to perform diagonal integration and also requires as a first step to convert all modalities to common features. We did this step in the same way as for Seurat. For all datasets except the cell lines, we ran Liger with default parameters. On the cell lines simulated experiment, using the default setting of 30 latent dimensions resulted in the embeddings from different modalities being completely separated. Since the latent dimensions can be interpreted as clusters in NMF we used this to set the number of latent dimensions to 3 which greatly improved Liger’s results. We could not tune other baselines similarly for this experiment as the dimension of their latent space can’t be interpreted similarly and this did provide a competitive advantage to liger since we used the knowledge that there were 3 main clusters in the dataset (which usually can’t be known when integrating new datasets). For the Protein and smFISH datasets we set the latent dimension to 15 since the default number of latent dimensions was close to or even higher than the number of features measured. For gene imputation in the scRNA-smFISH experiment, we used a knn regression with the scRNA embeddings serving as reference to predict the expression levels of held-out genes for smFISH embeddings.MultiMAPWe compare scConfluence to MultiMAP using the Python package MultiMAP v0.0.1. MultiMAP is a generalization of the popular UMAP method95 to the unpaired multimodal setting. MultiMAP combines intra-modality distances with prior knowledge-based cross-modality distances to recover geodesic distances between all cells on a single latent manifold which can then be projected on \({{\mathbb{R}}}^{2}\) for visualization. Intra-modality distances are computed based on low-dimensional projections of the data after preprocessing. We followed the documentation for this step although dimension reduction was not necessary for smFISH and proteomic datasets where the number of features was already lower than a hundred. To compute distances across modalities we converted pairs of modalities to a common feature space we did as described above in “Building the common features matrix” section, as it coincides with the indications described in the tutorials available in the MultiMAP documentation. MultiMAP was run with default parameters on all datasets. For gene imputation in the scRNA-smFISH experiment, we used knn regression with the scRNA embeddings serving as a reference to predict the expression levels of held-out genes for smFISH embeddings.UniportWe compare scConfluence to Uniport using the Python package Uniport v1.2.2. Uniport uses one encoder which takes as input cells from all modalities converted to common features while using modality-specific decoders to reconstruct each modality’s features. It also leverages unbalanced Optimal Transport in the latent space to force different modalities to mix in the latent space. For feature conversion, we proceed as described above in “Building the common features matrix” section, as it coincides with the indications described in the tutorials available in the Uniport documentation. We ran Uniport with default parameters on all datasets. For gene imputation in the scRNA-smFISH experiment, we used the scRNA decoder to map the embeddings of smFISH cells to the scRNA domain as described in the Uniport documentation.scGLUEWe compare scConfluence to scGLUE using the Python package scglue v0.3.2. scGLUE simultaneously trains one variational autoencoder per modality and one graph variational autoencoder which learns feature embeddings based on a prior knowledge-based guidance graph containing connections between features from different modalities. We followed scGLUE’s documentation to construct the guidance graph for scRNA and scATAC integration. For scRNA and Protein integration where no documentation was available, we created a graph where each coding gene was linked to its associated protein. For scRNA and smFISH integration, we created a graph with links between each smFISH-measured gene and the same gene in the scRNA data. We ran scGLUE with default parameters on all datasets, except for the Protein and smFISH datasets where we set the latent dimension to 15 since the default number of latent dimensions was close or even higher than the number of features measured. For gene imputation in the scRNA-smFISH experiment, we used the scRNA decoder to map the embeddings of smFISH cells to the scRNA domain as in other autoencoder-based methods.GimVIWe compare scConfluence to GimVI using the Python package scvi-tools v0.16.4. GimVI is only applicable to scRNA and smFISH integration and simultaneously trains one autoencoder per modality while enforcing mixing between modalities in the latent space with a discriminative neural network trained in an adversarial way. We ran GimVI with default parameters and performed gene imputation as described in its documentation: we used the scRNA decoder to map the embeddings of smFISH cells to the scRNA domain.Evaluation metricsWe used several scoring functions to assess the quality of the embeddings provided by each method throughout the benchmarking. All methods were run with five different random seeds and we reported the median score, except for Seurat which contains no randomness and could therefore be run with one seed only. Apart from FOSCTTM, all metrics are based on the k-nearest neighbor graph of embeddings. To give a complete overview of the performance of the methods, we computed those metrics with \(k\) taking all values in \(\{{\mathrm{5,10,15,20,35,50}}\}\). Those metrics are therefore displayed as curves whose x-axis corresponds to the values of \(k\).For the MultiMAP method whose output is not an embedding but a graph whose edge weights represent similarities between integrated cells, we can use this graph to compute nearest neighbors. Additionally, for the OP Multiome and OP Cite datasets which contained more than 60,000 cells per modality, we evaluated the methods after the training using a subset of 20,000 cell embeddings as the metrics were too expensive to compute on the full results of each method. We suppose in the following that only two modalities are being integrated, as is the case for all benchmarked datasets.NotationsFor each cell \(i\) from the pth modality, we denote as \({N}_{k}^{\left(p\right)}\left(i\right)\) the \(k\) nearest neighbors of the cell’s embedding \({{{\bf{z}}}}_{{{\rm{i}}}}^{\left(p\right)}\) in the integrated latent space. \({c}_{i}\) denotes the cell type label of cell \(i\).PurityThe purity score measures the average proportion of an integrated cell’s k-nearest neighbors that share the sample’s cell type annotation20. It thus varies between 0 and 1 with a higher score indicating a stronger performance.The score can be written as \(\frac{1}{{n}_{1}+{n}_{2}}\left({\sum }_{i=1}^{{n}_{1}}\frac{|\{t\in {N}_{k}^{(1)}(i)|{c}_{t}={c}_{i}\}|}{k}+\right. \left.{\sum }_{j=1}^{{n}_{2}}\frac{|\{t\in {N}_{k}^{(2)}(j)|{c}_{t}={c}_{j}\}|}{k}\right)\).Transfer accuracyThe transfer accuracy is the accuracy of a k-nearest neighbor classifier using one modality as a reference and the other modality as a query. Since both modalities can be the reference and the query, we compute the results of both classifications and report the average of the two scores.Graph connectivityThe graph connectivity metric assesses how well cells with the same cell type label are connected in the kNN graph representation of the embeddings47. This score can be used to detect whether there exist discrepancies in the integrated latent space between cells from different modalities or experimental batches. For each different cell type \(c\), we denote as \({G}_{k}\left(c\right)\) the subset of the integrated kNN graph containing only cells with label \(c\). We compute for each cell type \(c\) the score \({s}_{c}\) equal to the size of the largest connected component in \({G}_{k}\left(c\right)\) divided by the number of cells with label \(c\). The final graph connectivity score is the average of the cell type scores \({s}_{c}\).FOSCTTMThe Fraction Of Samples Closer Than the True Match (FOSCTTM) metric has been used before to evaluate diagonal integration methods on paired multimodal datasets where both modalities are measured in the same cells31,38,46. Since this metric is only designed for paired datasets we can suppose that there are exactly \(n\) cells for each modality and that they are ordered such that the ith cell in the first modality is the true match of the ith cell in the second modality. FOSCTTM aims at comparing for every cell from modality \(p\) the distance to its true match and the distance to all other cells in the opposite modality which we denote as \({p}^{{\prime} }\) (since \(p\in \{1,2\}\), \({p}^{{\prime} }=3-p\)).It is classically defined as:$${{\rm{FOSCTTM}}}=\frac{1}{2n}{\sum }_{i=1}^{n}\frac{{r}_{i}^{(1)}}{n}+\frac{{r}_{i}^{(2)}}{n}$$
(14)
where \({r}_{i}^{\left(p\right)}=\left|\left\{j\in \left[1..n\right] \, s.t. \, d\left({{{\bf{z}}}}_{i}^{\left(p\right)},\, {{{\bf{z}}}}_{j}^{\left({p}^{{\prime} }\right)}\right) < \, d\left({{{\bf{z}}}}_{i}^{\left(p\right)},\, {{{\bf{z}}}}_{i}^{\left({p}^{{\prime} }\right)}\right)\right\}\right|\)However, in this paper, we compute it in a slightly different way. In the previous formula (Eq. 14), we replace \({r}_{i}^{\left(p\right)}\) with \({\tilde{r}}_{i}^{\left(p\right)}=\left|\left\{j\in \left[1..n\right]\setminus \left\{i\right\} \, s.t. \, d\left({{{\bf{z}}}}_{i}^{\left(p\right)},{{{\bf{z}}}}_{j}^{\left(p\right)}\right) < \, d\left({{{\bf{z}}}}_{i}^{\left(p\right)},\, {{{\bf{z}}}}_{i}^{\left({p}^{{\prime} }\right)}\right)\right\}\right|\). This means that rather than only assessing how close the true match of a cell’s embedding is compared to cells from the opposite modality, we assess how close the true match of the cell’s embedding is compared to all cells from both modalities.With this formula, we can simultaneously evaluate whether the mixing of the two modalities in the shared latent space is complete and verify that corresponding populations are accurately matched across modalities. Other metrics originally designed to assess batch effect correction are often used to evaluate the mixing of modalities such as the batch entropy of mixing96 but these don’t penalize artificial alignments. The complete overlapping of cells from different modalities only matters if those cells are biologically equivalent and this is assessed by our modified formulation of the FOSCTTM.Cell type FOSCTTMAdditionally, we reported in Supplementary Fig. 6 one other version of the FOSCTTM metric called “cell type FOSCTTM” which only considers cells from the same cell type.This additional metric allows us to measure whether paired embeddings are closer to each other than other cells from the same cell type, thus enabling us to find out whether the integration learns sub-cell type structures. More formally, the cell type FOSCTTM is defined as:$${{\mbox{cell type FOSCTTM}}}=\frac{1}{2n}\sum \limits_{i=1}^{n}\frac{{\hat{r}}_{i}^{\left(1\right)}+{\hat{r}}_{i}^{\left(2\right)}}{2{\sum }_{k=1}^{n}{{{\bf{1}}}}_{\left({c}_{i}={c}_{k}\right)}}$$
(15)
where \({\hat{r}}_{i}^{\left(p\right)}={r}_{i}^{\left(p\right)}+\left|\left\{j\in \left[1..n\right]\setminus \left\{i\right\} \, s.t. \, d\left({{{\bf{z}}}}_{i}^{\left(p\right)},{{{\bf{z}}}}_{j}^{\left({p}^{{\prime} }\right)}\right)\, < \, d\left({{{\bf{z}}}}_{i}^{\left(p\right)},{{{\bf{z}}}}_{j}^{\left({p}^{{\prime} }\right)}\right)\,{{\rm{and}}}\,{c}_{i}={c}_{j}\right\}\right|\).Gene imputationBoth the scRNA-seq and smFISH datasets were downloaded using the scvi-tools helper functions load_cortex() and load_smfish() which select only the overlapping cell types to ensure the consistency of the imputation task. We then divided the 33 genes measured in the smFISH experiment into eleven disjoint groups of 3 genes. Each of these groups corresponded to a different scenario where the three genes were removed from the smFISH data and held-out and then imputed by each method.Considering the prior knowledge used to connect features across modalities was very reliable (as we could map genes in the smFISH experiment with themselves in the scRNA-seq without any errors) we increased the weight of the IOT loss \({{{\rm{\lambda }}}}_{{IOT}}\) from 0.01 to 0.05 in this experiment.For both methods which were autoencoder based, Uniport and scGLUE, we used the same technique as us to perform the imputation. For all other methods, we used knn regression using the scRNA-seq embeddings as a reference to predict the expression levels of held-out genes in smFISH embeddings.We used the Spearman correlation to quantify the similarity between the imputed values of a gene across all cells with the ground-truth held-out values. It is defined as the Pearson correlation between the rank values of those two vectors. As in the benchmarking section, we ran each method on each scenario with five initialization seeds (except Seurat which contains no stochasticity). For each gene in each scenario, we kept the median Spearman correlation across the five seeds. We then reported one score per imputation scenario and plotted the eleven scores as a violin plot. We can aggregate the Spearman correlation of the three genes forming each scenario using the average or the median therefore we report both the average and median Spearman correlations (aSCC and mSCC).For the visualization of the imputations, we made use of the recorded 2D positions of the smFISH cells to plot the cells as they are located in the tissue. To better visualize spatial patterns of the imputations, we used the histogram equalization technique on the imputed values.Tri-omics integrationWe removed very rare cell types (containing less than 0.5% of the whole dataset) from all three datasets. this resulted in the removal of ATAC “Immature NK”, “Basophil”, “Naive CD8 T2” and “Naive CD4 T2“cells as well as CyTOF “Plasmablasts”, “cDC1”, “CLA + HLADR + NK cells”, “Activated gd T cells”, “Unclassified”, “HLADR + CD38− CD4 T cells”, “Cytotoxic CD4 T cells”, “DN T cells” and “Activated CD4 T cells”.We clustered the cell embeddings from all modalities using Scanpy’s Louvain with a resolution of 0.5.We then focused on the B cells and monocyte clusters which we reclustered with a resolution of 0.2.To assess whether the subclusters we found were correctly aligned across modalities we used the same methods as described in scGLUE31 except that we only used scRNA and scATAC-derived gene activities. Indeed, including the CyTOF data in this analysis would have resulted in removing too many features to be able to design a statistical test with sufficient power. For each of the two populations we subclustered (B cells and monocytes), we tested for significant overlap in cell type marker genes. For both gene expression and gene activities, the cell type markers were identified using Scanpy’s one-versus-rest Wilcoxon rank-sum test with the following criteria: FDR < 0.05 and log fold change > 0. The significance of marker overlap was determined by Fisher’s exact test.Patch-seqWe removed the cells that were labeled as “unclassified” or which belonged to rare cell types (there were less than 15 cells labeled either as “Scng” or “NP”).For the scRNA modality, we didn’t use the scVI decoder with a ZINB loss but rather just a fully connected decoder with an \({L}_{2}\) loss on the log-normalized counts as it fitted better the data. Similarly to the smFISH/scRNA experiment, we increased the weight of the IOT loss \({\lambda }_{{IOT}}\) to 0.05. Indeed our prior knowledge about connections between features across modalities consisted in connecting each gene with itself as scRNA measurements were available for both modalities. However, in contrast with the smFISH experiment where only a few dozen of genes had been measured in both modalities, here all genes could be connected, making the prior information much stronger than in previous cases. This resulted in the Sinkhorn regularization not being necessary to obtain a good mixing of the two modalities, hence we set \({\lambda }_{r}\) to 0. Moreover, the sets of cells from the two modalities being actually two independent subsets from the exact same dataset, we could expect very little heterogeneity between the cell populations present in each modality and increased the transported mass parameter \(m\) from 0.5 to 0.75 in this experiment.Statistics and reproducibilityThe detailed statistical tests were indicated in figures or associated legends where applicable. No statistical method was used to predetermine sample size. No data were excluded from the analyses. Complete randomization was performed for allocating groups. Our study does not involve group allocation that requires blinding.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles