Gene-level alignment of single-cell trajectories

Genes2Genes: a new alignment framework for single-cell trajectoriesDP4 remains central to many sequence alignment algorithms.G2G performs gene-level pseudotime trajectory alignment between a single-cell reference and query, by running DP alignment independently for all genes of interest. It aims to generate an optimal sequence of matched and mismatched time points for each gene. There are five different alignment states possible between two reference and query time points (Fig. 1b). For each time point in any gene trajectory, there is a respective expression distribution as observed via scRNA-seq measurements. G2G evaluates the distances of these distributions in reference and query to infer an optimal gene alignment.Pairwise time series alignment for trajectory comparisonTrajectory is a continuous path of change through some feature space, along an axis of progression (such as time)59. In single-cell transcriptomics, this feature space is often defined by genes. A trajectory through a high-dimensional gene space describes the state of a cell as a function of time. The pseudotime of cells represents a discretization of their cell-state trajectory. Their genes form a multivariate time series of expression, with each gene as univariate. In this work, we consider pairwise alignment of univariate time series, which enables gene-level trajectory alignment.Given two discrete time series (sequences), reference \(R\) and query \(Q\) of length (a finite number of time points) \({|R|}\) and \({|Q|}\), their pairwise alignment describes an optimal sequential mapping between their time points. As an optimization problem, this has two key properties: (1) an optimal substructure; and (2) overlapping set of subproblems, which make it solvable by DP. Property (1) means, the optimal alignment of any two prefixes R1‥j and Q1‥i depends on the optimality of three subalignments: (i) R1‥j-1 and Q1‥i-1; (ii) R1‥j-1 and Q1‥i; and (iii) R1‥j and Q1‥i-1. Property (2) means, there exists prefix alignments that are overlapping. DP begins optimizing prefix alignment, starting from null (\(\varPhi\)) sequences until it completes aligning the entire two sequences. This process computes overlapping subproblems only once and reuses them through a memoization (history) matrix \({Hist}\). In standard DP alignment, \({Hist}(i,j)\) stores the optimal alignment cost of the two prefixes: \({R}_{1:j}\) and \({Q}_{1:i}\), by optimizing an objective function that quantifies the alignment through a set of recurrence relations. Once \({Hist}\) is computed, the optimal alignment can be retrieved by backtracking, starting from \({Hist}({|Q|}+1,{|R|}+1)\) until reaching \({Hist}(\mathrm{0,0})\).Preprocessing a trajectory time series by distributional interpolationInterpolation of time series is necessary to ensure smoothly changing and uniformly distributed data (at least approximately). This is because non-interpolated data cannot guarantee a reliable alignment7,13. We interpolate all reference/query gene expression trajectories before alignment, by extending CellAlign’s7 mean-based interpolation method to distributional interpolation.Given pseudotime series \(t\) of (log1p-normalized) expression in gene \({g}_{j}\) of a single-cell dataset, we first transform the pseudotime axis to [0,1] range using min–max normalization.Then, \(m\) equispaced artificial (interpolation) time points are defined, and for each interpolation time point \(t{\prime}\), we estimate a Gaussian distribution (of mean \({g}_{j}(t^{\prime} )_{{\mathrm {mean}}}\) and s.d. \({g}_{j}(t{\prime} )_{\mathrm {{s.d.}}}\)) using the Gaussian kernel-based weighted approach. For each cell \(i\) annotated with pseudotime \({t}_{i}\), a weight is computed with respect to each \(t{\prime}\) as:$${w}_{i}=\exp\left(-\frac{{({t}_{i}-t{\prime} )}^{2}}{\rm{window}\_{\rm{siz}}{e}^{2}}\right),$$where \({\mathrm{window}}\_{\mathrm{size}}=0.1\). Below equations estimate \({g}_{j}(t{\prime} )_{{mean}}\) and \({g}_{j}(t{\prime} )_{\mathrm {{s.d.}}}\):$${g}_{j}(t{\prime} )_{{mean}}=\frac{1}{\varSigma {w}_{i}}\mathop{\sum }\limits_{i=1}^{n}{w}_{i}{g}_{j}({t}_{i})$$$${g}_{j}(t{\prime} )_{\mathrm {{s.d.}}}={c}_{t{\prime} }\sqrt{n\frac{\mathop{\sum }\nolimits_{i=1}^{n}{w}_{i}{[{g}_{j{\rm{\_}}{\mathrm {mean}}}-{g}_{j}({t}_{i})]}^{2}}{(n-1)\varSigma {w}_{i}}}$$where \({g}_{{j\_mean}}=\frac{{\sum }_{i=1}^{n}{g}_{j}({t}_{i})}{n}\), \(n\) is the total number of cells, and \({c}_{t{\prime} }\) is the expected weighted cell density at \(t{\prime}\), that is, \({c}_{t{\prime} }=\frac{{\sum }_{i=1}^{n}{w}_{i}}{n}\), used to account for cell abundance when estimating variance (otherwise a very few cells may give very high variance). Next, we generated 50 random points from Gaussian distribution \(N({g}_{j}(t{\prime} )_{\mathrm {{mean}}},{g}_{j}(t{\prime} )_{\mathrm {{s.d.}}})\) for each \(t{\prime}\), representing the interpolated distribution of single-cell gene expression. Note that we used a predefined \(m\) for both reference and query. The interpolation has O\(({nm})\) time complexity due to taking weighted contribution from all cells at each \(t{\prime}\). For efficiency, one could subsample datasets and/or restrict the contributing cells to the nearest neighborhood.Extreme casesWhen the smooth trajectory assumption breaks (for example, pluripotent genes suddenly dropping to zero mid-way after highly expressed in early development), the interpolated variance might not reflect the true observation. Also, when a gene is (almost) zero-expressed, there is no distribution to model. To handle such extreme cases when interpolating reference and query gene trajectories, G2G applies the below steps:

For either trajectory, check for regions (adjacent interpolation points) showing abrupt zero-expression (<3 cells expressed) in considerable lengths (exceeding 0.2), by sliding-window-scanning; apply a common \(\sigma\) (10% of the minimum \(\sigma\) estimated across all interpolation points) for those regions to have a very low \(\sigma\) with zero mean.

If either gene is zero-expressed (<3 cells expressed) across pseudotime, apply 10% of the minimum \(\sigma\) estimated across all interpolation points of the other trajectory as the \(\sigma\) for the extreme-case trajectory and vice versa.

A new DP algorithm for gene-level trajectory alignmentOur DP algorithm is inspired by biological sequence alignment discussed in the related literature22,23,24. It generates an alignment between \(R\) and \(Q\) expression time series of a specified gene by adapting Gotoh’s algorithm16 and DTW5 to accommodate five alignment states (Fig. 1b), that is, one-to-one match (M), one-to-many match (V), many-to-one match (W), insertion (I) and deletion (D), between a pair of \(R\) and \(Q\) time points. The five-state space is denoted by \(\varOmega\) = [M, V, W, D, I]. V and W represent warps.DTW5 is extensively used to align time series. Sankoff and Kruskal (1983)13 previously discussed how to capture both warps and indels from a single algorithm and provided DP recurrences for evaluating all states in \(\varOmega\) to assign an optimal state for each pair of \(R\) and \(Q\) time points. Extending this further, we implemented Gotoh’s algorithm (of O\(({|R||Q|})\) time complexity) to generate an optimal five-state alignment string for \(R\) and \(Q\) using:

a Bayesian information-theoretic distance measure between two expression distributions under the minimum message length inference criterion13,26.

a five-state machine that models state transitions along an alignment.

The DP scoring scheme evaluates every pair of \(R\) time point \(j\) (\({R}_{j}\)) and \(Q\) time point \(i\) (\({Q}_{i}\)) by computing two costs: (1) the cost of matching them (denoted by \(\mathrm {{Cos}{t}}_{{\mathrm {match}}}(i,j)\)) based on their interpolated gene expression distributions, and (2) the cost of assigning an alignment state \(x\in \varOmega\) for them.The DP scoring schemeThe cost of match between R
j and Q
i
\({R}_{j}\) and \({Q}_{i}\) are expected to match if they have similar expression distributions. To score their match likelihood, we define a cost (distance measure) between the two expression distributions of \({R}_{j}\) and \({Q}_{i}\), modeled as two Gaussians. We compute this over the interpolated single-cell expression data at \({R}_{j}\) (denoted by \(R(j\;)\) under \(N({\mu }_{R(j)},{\sigma }_{R(j)})\)) and \({Q}_{i}\) (denoted by \(Q(i)\) under \(N({\mu }_{Q(i)},{\sigma }_{Q(i)})\)) with respective mean (\(\mu\)) and s.d. (\(\sigma\)) statistics. Accordingly, if \({D}_{R(j)}={{\{d}_{k}\}}_{k=1}^{{|R}(j)|}\) and \({D}_{Q(i)}={{\{d}_{k}\}}_{k=1}^{{|Q}(i)|}\) are their expression data vectors, then: \({d}_{k} \sim N({\mu }_{R(j)},{\sigma }_{R(j)})\) \({\forall\;{d}_{k}\in D}_{R(j)}\) and \({d}_{k} \sim N({\mu }_{Q(i)},{\sigma }_{Q(i)})\;{\forall\; {{d}}_{k}\in D}_{Q(i)}\).Hereafter, we denote \(N({\mu }_{R(j)},{\sigma }_{R(j)})\) by \({N}_{R(j)}\) and \(N({\mu }_{Q(i)},{\sigma }_{Q(i)})\) by \({N}_{Q(i)}\).We implement the cost function, \({\mathrm{Cost}}_{\rm{match}}(i,j)\), to consider both data (\({D}_{R(j)}\) and \({D}_{Q(i)}\)) and models (\({N}_{R(j)}\) and \({N}_{Q(i)}\)) when computing the distance between \(R(j)\) and \(Q(i)\), using the MML criterion26,27. See Fig. 2 (top left) and Supplementary Fig. 1a for illustrations of our MML framework.
Primer on MML
MML is an inductive inference paradigm for model comparison and selection, grounded on Bayesian statistics and information theory. Given a hypothesis (model) \(H\) and data \(D\), it lays an imaginary message transmission from a sender who jointly encodes \(H\) and \(D\), for lossless decoding at a recipient’s side. Bayes theorem defines their joint probability as:$$\Pr (H,D)=\Pr (H)\Pr (D{\rm{|}}H)=\Pr (D) \Pr (H{\rm{|}}D)$$
Separately, Shannon information28 (\(I\)) defines the optimal encoding length of an event \(E\) with probability \(\Pr (E)\) as:$$I(E)=-\mathrm {{lo}{g}}_{e}(\Pr (E))$$measured in nits. Applying this to Bayes theorem describes the information needed to encode \(H\) and \(D\) jointly as:$$I(H,D)=I(H)+I(D{\rm{|}}H)$$
(1)

This gives a two-part total encoding length for \(H\) and \(D\), where \(I(H)\) quantifies the information of \(H\), and \(I({D|H})\) quantifies the information of \(D\) using \(H\). When two hypotheses, \({H}_{1}\) and \({H}_{2}\), describe \(D\), MML enables selecting the best hypothesis with model complexity versus model-fit tradeoff, by evaluating a compression statistic \(\varDelta =I({H}_{1},D)-I({H}_{2},D)\), which gives the log odds posterior ratio between them.$$\varDelta =\log\left(\frac{\Pr ({H}_{2},D)}{\Pr ({H}_{1},D)}\right)=\log\left(\frac{\Pr (D)\Pr ({H}_{2}{\rm{|}}D)}{\Pr (D)\Pr ({H}_{1}{\rm{|}}D)}\right)=\log\left(\frac{\Pr ({H}_{2}{\rm{|}}D)}{\Pr ({H}_{1}{\rm{|}}D)}\right)$$
(2)

\(\varDelta\)>0 implies that \({H}_{2}\) is \({e}^{\varDelta }\) times more likely than \({H}_{1}\) and vice versa.

Casting Costmatch (i,j) under MML
Given the data \(D\) (\({D}_{R(j)}\) and \({D}_{Q(i)}\)) and Gaussian models (\({N}_{R(j)}\) and \({N}_{Q(i)}\)), we formulate two hypotheses:

Hypothesis A (\({R}_{j}\) and \({Q}_{i}\) match): explains \(D\) with a single, representative model \(N(\;{\mu }_{* },{\sigma }_{* })\) denoted by \({N}_{* }\) (= either \(\,{N}_{R(j)}\) or \({N}_{Q(i)}\)),

Hypothesis \(\varPhi\) (\({R}_{j}\) and \({Q}_{i}\) mismatch): explains \({D}_{R(j)}\) with \({N}_{R(j)}\) and \({D}_{Q(i)}\) with \({N}_{Q(i)}\), independently,

and compute, \(I(A,D)\) and \(I(\varPhi ,D)\) according to equation (1):$$I(A,D)=I(A)+I(D{\rm{|}}A)$$
(3)
$$I(\varPhi ,D)=I(\varPhi )+I(D{\rm{|}}\varPhi )$$
(4)
where, \(A\) = \({[N}_{* }]\) and \(\varPhi =[{N}_{R(j)},{N}_{Q(i)}]\). Accordingly, equation (3) becomes:$$I(A,D)=I({N}_{* })+I(D{\rm{|}}{N}_{* })=I(\;{\mu }_{* },{\sigma }_{* })+I(D{\rm{|}}{\mu }_{* },{\sigma }_{* })$$
Similarly, equation (4) becomes:$$\begin{array}{l}I(\varPhi ,D)=I\left({N}_{R(j)},{N}_{Q(i)}\right)+I\left(D{{\big|}}{N}_{R(j)},{N}_{Q(i)}\right)\\\qquad\quad\;=I\left({N}_{R(j)}\right)+I\left({N}_{Q(i)}\right)+I\left(D{{\big|}}{N}_{R(j)},{N}_{Q(i)}\right)\\\qquad\quad\,=I\left(\;{\mu }_{R(j)},{\sigma }_{R(j)}\right)+I\left(\;{\mu }_{Q(i)},{\sigma }_{Q(i)}\right)\\\qquad\qquad+I\left({D}_{R(j)}{{\big|}}{\mu }_{R(j)},{\sigma }_{R(j)}\right)+I\left({D}_{Q(i)}{{\big|}}{\mu }_{Q(i)},{\sigma }_{Q(i)}\right)\end{array}$$
The next section describes how equations (3) and (4) terms are calculated. We normalize \(I(\varPhi ,D)\) and \(I(A,D)\) to compute per-datum information (entropy):$$I{\left(A,D\right)}_{\rm{entropy}}=\frac{I(A,D)}{{\rm{|}}{D}_{R(j)}{\rm{|}}+{\rm{|}}{D}_{Q(i)}{\rm{|}}}$$$$I{\left(\varPhi ,D\right)}_{\rm{entropy}}=\frac{I(\varPhi ,D)}{{\rm{|}}{D}_{R(j)}{\rm{|}}+{\rm{|}}{D}_{Q(i)}{\rm{|}}}$$
Note that the \(I{\left(A,D\right)}_{\rm{entropy}}\) measure is made symmetric as:$$I(A,D)_{\rm{entropy}}=\frac{I({N}_{R\left(j\right)},D)_{\rm{entropy}}+I({N}_{Q\left(i\right)},D)_{\rm{entropy}}}{2}{\rm{nits}}\; {\rm{per}}\;{\rm{datum}}$$
We then define the compression statistic \(\varDelta\) as our \({\rm{Cost}}_{\rm{match}}(i,j)\):$$\varDelta =I{\left(A,D\right)}_{\rm{entropy}}-I{\left(\varPhi ,D\right)}_{\rm{entropy}}$$
When \(R(j)\) and \(Q(i)\) are significantly dissimilar, \(I(A,D)_{\rm{entropy}}\) \(> I(\varPhi ,D)_{\rm{entropy}}\). Thus, \({\rm{Cost}}_{\rm{match}}(i,j)\) increases when distributions diverge (Extended Data Fig. 1b and Supplementary Fig. 1b,c).

Computing the Shannon Information of any Gaussian model and data
\({Cos}{t}_{{match}}(i,j)\) computation uses MML Wallace–Freeman approximation26,28 defined for Gaussian distributions27,60. As in equation (1), for any dataset \(D\) and hypothesis \(H\) describing \(D={{\{x}_{k}\}}_{k=1}^{X}\) under \(N(\mu ,\sigma )\) with parameters \(\vec{\theta }=(\mu ,\sigma )\), the information of \(H\) and \(D\) is:$$I(H,D)=I(\vec{\theta },D)=I(\vec{\theta })+I(D{\rm{|}}\vec{\theta })$$expanding to:$$I(\vec{\theta },D)=\frac{d}{2}\log ({\kappa }_{d})-\log [h(\vec{\theta })]+\frac{1}{2}\log (\det [{\rm{Fisher}}(\vec{\theta })])+{\rm{L}}(\vec{\theta })+\frac{\rm{d}}{2},$$where \(d\) is the number of free parameters (\(d=2\) for a Gaussian) and \({\kappa }_{d}\) is the Conway lattice constant61 (\({\kappa }_{d}\) is \(\frac{5}{36\sqrt{3}}\) for \(d=2\)); \(h(\vec{\theta })\) is the prior over \(\vec{\theta }\). \(\mu\) and \(\log (\sigma )\) are defined with uniform priors over predefined ranges of length \({R}_{\mu }\) and length \({R}_{\sigma }\), respectively:$$h(\vec{\theta })=h(\mu )\;h(\sigma )=\left(\frac{1}{{R}_{\mu }}\right)\left(\frac{1}{\sigma {R}_{\sigma }}\right)$$$$\Rightarrow I[h(\vec{\theta })]=\log (\sigma )+\log ({R}_{\mu }{R}_{\sigma })$$
We use \({R}_{\mu }\) = 15.0 and \({R}_{\sigma }\,\)= 3.0 as reasonable for log-normalized expression (for example, across 20,240 genes, we observe ~8.1 maximum expression and ~1.7 maximum \(\sigma\) in the pan fetal reference).
\(L(\vec{\theta })\) is the negative log likelihood:$$L(\vec{\theta })=X\log (\sigma )+\frac{X}{2}\log (2\pi )+\frac{1}{2{\sigma }^{2}}\mathop{\sum }\limits_{i=1}^{X}{\left({x}_{i}-\mu \right)}^{2}-\mathop{\sum }\limits_{i=1}^{X}\log (\epsilon )$$where, \(\epsilon\) is the precision of datum measurement (taken as \(\epsilon\) = 0.001). \(\det [{Fisher}(\vec{\theta })]\) is the determinant of the expected Fisher matrix (the matrix of the expected second derivatives of the negative log-likelihood function), which has the closed form \(\frac{{2X}^{\,2}}{{\sigma }^{4}}\).
The cost of alignment state assignment for \({{\boldsymbol{R}}}_{{\boldsymbol{j}}}\) and \({{\boldsymbol{Q}}}_{{\boldsymbol{i}}}\)
The DP scoring scheme also involves a cost of assigning an alignment state \(x\in \varOmega\) = [M, W, V, D, I] for \({R}_{j}\) and \({Q}_{i}\). This is computed as the Shannon information28 required to encode state \(x\) given previous state \(y\) assigned for the preceding time points, \(I({x|\;y})=-{lo}{g}_{e}(\Pr ({x|\;y}))\). We define a five-state machine (middle left of Fig. 2) to explain these conditional probabilities (also called state transitions), by extending the three-state machine22,23 of [M,I,D] to accommodate [W, V]warp states (Fig. 1b). We enforce symmetry while treating <I and D> and <W and V> equivalently and prohibiting transitions, I → W and D → V, as they imply a single M; however, we can allow D → W and I → V, as there can be a case of a warp match after an insertion or deletion. All outgoing transitions of each state sum up to a probability of 1. Overall, there are 23 state transitions in this machine, yet with only three free transition probability parameters \([\Pr\)(M | M), \(\Pr\)(I | I) and \(\Pr\)(M | I)\(]\) due to its symmetry and characteristics. These probabilities control the expected lengths of matches and mismatches (reflecting an affine gap scheme). In this work, we chose \([\Pr\)(M | M) = 0.99, \(\Pr\)(I | I) = 0.1, \(\Pr\)(M | I) = 0.7\(]\) as the default in G2G based on a grid search that minimized the alignment inaccuracy rate in our simulated dataset 1. An interesting future direction would be to infer them using an added optimization layer on top of DP optimization.Altogether, the G2G DP scoring scheme utilizes \({\mathrm{Cost}}_{\rm{match}}(i,j\;)\) and the five-state machine (with state-assignment costs evaluated as \(I\left(x|y\right)\,\forall\;{x},y\in \varOmega\)), to define DP recurrence relations.
DP recurrence relations
We define the \(\{\rm{Hist}_{x}\}_{\forall x\in \varOmega }\) matrices corresponding to the alignment states in \(\varOmega\). Every \({\rm{Hist}}_{x}\) has (\(\left.{|Q|}+1\times {|R|}+1\right)\) dimensions, where the columns and rows correspond to \(R\) and \(Q\) time points, respectively. \({\rm{Hist}}_{x}(i,j)\) stores the optimal alignment cost of prefixes R1‥j and Q1‥i ending in state \(x\). The DP recurrences to compute \({\rm{Hist}}_{x}(i,j)\) for \(i > 0,j > 0\) are:$${\rm{Hist}}_{\rm{{M}}}\left(i,j\right)=\min \left\{\begin{array}{c}{\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{M}}}\left(i-1,\;j-1\right)+I\left(\rm{{M}}|{\rm{{M}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{W}}}\left(i-1,\;j-1\right)+I\left({\rm{{M}}}|{\rm{{W}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{V}}}\left(i-1,\;j-1\right)+I\left({\rm{{M}}}|{\rm{{V}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{D}}}\left(i-1,\;j-1\right)+I\left({\rm{{M}}}|{\rm{{D}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{I}}}\left(i-1,\;j-1\right)+I\left({\rm{{M}}}|\rm{{I}}\right)\end{array}\right.$$$${\rm{Hist}}_{\rm{{W}}}\left(i,j\right)=\min \left\{\begin{array}{c}{\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\,\rm{{M}}}\left(i,\;j-1\right)+I\left({\rm{{W}}}|{\rm{{M}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\,{\rm{{W}}}}\left(i,\;j-1\right)+I\left({\rm{{W}}}|{\rm{{W}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\,{\rm{{V}}}}\left(i,\;j-1\right)+I\left({\rm{{W}}}|\rm{{V}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\,{\rm{{D}}}}\left(i,\;j-1\right)+I\left({\rm{{W}}}|{\rm{{D}}}\right)\end{array}\right.$$$${\rm{Hist}}_{\rm{{V}}}\left(i,j\right)=\min \left\{\begin{array}{c}{\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{M}}}\left(i-1,\;j\right)+I\left({\rm{{V}}}|{\rm{{M}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{W}}}\left(i-1,\;j\right)+I\left({\rm{{V}}}|{\rm{{W}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{V}}}\left(i-1,\;j\right)+I\left({\rm{{V}}}|{\rm{{V}}}\right)\\ {\rm{Cost}}_{\rm{match}}\left(i,\;j\right)+{\rm{Hist}}_{\rm{{I}}}\left(i-1,\;j\right)+I\left({\rm{{V}}}|{\rm{{I}}}\right)\end{array}\right.$$$${\rm{Hist}}_{\rm{{D}}}\left(i,j\right)=\min \left\{\begin{array}{c}{\rm{Hist}}_{\rm{{M}}}\left(i,\;j-1\right)+I\left({\rm{{D}}}|{\rm{{M}}}\right)\\ {\rm{Hist}}_{\rm{{W}}}\left(i,\;j-1\right)+I\left({\rm{{D}}}|{\rm{{W}}}\right)\\ {\rm{Hist}}_{\rm{{V}}}\left(i,\;j-1\right)+I\left({\rm{{D}}}|{\rm{{V}}}\right)\\ {\rm{Hist}}_{\rm{{D}}}\left(i,\;j-1\right)+I\left({\rm{{D}}}|{\rm{{D}}}\right)\\ {\rm{Hist}}_{\rm{{I}}}\left(i,\;j-1\right)+I\left({\rm{{D}}}|{\rm{{I}}}\right)\end{array}\right.$$$${\rm{Hist}}_{\rm{{I}}}\left(i,j\right)=\min \left\{\begin{array}{c}{\rm{Hist}}_{\rm{{M}}}\left(i-1,\;j\right)+I\left({\rm{{I}}}|{\rm{{M}}}\right)\\ {\rm{Hist}}_{\rm{{W}}}\left(i-1,\;j\right)+I\left({\rm{{I}}}|{\rm{{W}}}\right)\\ {\rm{Hist}}_{\rm{{V}}}\left(i-1,\;j\right)+I\left({\rm{{I}}}|{\rm{{V}}}\right)\\ {\rm{Hist}}_{\rm{{D}}}\left(i-1,\;j\right)+I\left({\rm{{I}}}|{\rm{{D}}}\right)\\ {\rm{Hist}}_{\rm{{I}}}\left(i-1,\;j\right)+I({\rm{{I}}|{\rm{{I}}}})\end{array}\right.$$
They are initialized as:$${\rm{Hist}}_{x}{(i,j)}_{x\in \{\rm{{M}},{\rm{{W}}},{\rm{{V}}}\}}=\left\{\begin{array}{cc}0; & i > 0,\;j > 0\\ \infty ; & {\rm{otherwise}}\end{array}\right.$$$${\rm{Hist}}_{\rm{{I}}}\left(i,j\right)={\rm{Hist}}_{\rm{{I}}}\left(i-1,0\right)+I\left({\rm{{I}}}{\rm{|}}{\rm{{I}}}\right)$$$${\rm{Hist}}_{\rm{{D}}}\left(i,j\right)={\rm{Hist}}_{\rm{{D}}}\left(0,j-1\right)+I\left({\rm{{D}}}{\rm{|}}{\rm{{D}}}\right)$$
Note that for the cases of <\(i=1\) and \(j=1\)> (before the first state transition), either a uniform transition cost (\(I\)(M) = \(I\)(I) = \(I\)(D) \(=-{\mathrm {lo}{g}}_{e}(1/3)\)) or a setting with lower cost for M can be assigned.
Once matrices are complete, G2G generates the optimal alignment \({Y}^{*}\) for \(R\) and \(Q\) as a five-state string \({Y}_{{str}}^{*}\) (where character \({Y}_{{str}}^{*}\left[k\right]\in \varOmega \,\forall k\in {{\mathbb{Z}}}_{[0,|{Y}^{*}|]}\)), by backtracking from:$$\min \left\{\begin{array}{c}{\rm{Hist}}_{\rm{{M}}}\left({\rm{|}}Q{\rm{|}},{\rm{|}}R{\rm{|}}\right)\\ {\rm{Hist}}_{\rm{{W}}}\left({\rm{|}}Q{\rm{|}},{\rm{|}}R{\rm{|}}\right)\\ {\rm{Hist}}_{\rm{{V}}}\left({\rm{|}}Q{\rm{|}},{\rm{|}}R{\rm{|}}\right)\\ {\rm{Hist}}_{\rm{{D}}}\left({\rm{|}}Q{\rm{|}},{\rm{|}}R{\rm{|}}\right)\\ {\rm{Hist}}_{{{I}}}\left({{|}}Q{{|}},{{|}}R{{|}}\right)\end{array}\right.$$
The optimal cost landscape matrix \(L\) can be constructed as:$$L(i,j)=\rm{min}_{\forall x\in \varOmega}\{\rm{His}{t}_{x}(i,j)\}$$
\({Y}^{\;* }\) describes the set of \(R\) and \(Q\) time point pairs matched and the set of \(R\) and \(Q\) time points mismatched, sequentially. Let \({T}_{\mathrm {{matched}}}\) be the set of matched time point pairs \((i,j)\) in \({Y}^{\,* }\). The total alignment cost of \({Y}^{\,* }\) is the sum of the total match cost (\({C}_{{\mathrm {match}}}\)) and the total state-assignment cost (\({C}_{{\mathrm {state}}}\)), where:$${C}_{\rm{match}}({Y}^{\,* })=\sum _{\forall (i,\;j)\in {T}_{\rm{matched}}}{\rm{Cost}}_{\rm{match}}(i,j)$$$${C}_{\rm{state}}({Y}^{\,* })=\mathop{\sum }\limits_{k=1}^{{{\rm{|}}Y}^{\,* }{\rm{|}}}I({Y}^{\,* }[k]{\rm{|}}{Y}^{\,* }[k-1])$$
Overall, \({Y}^{\,* }=\arg\min_{\;\forall Y \in \mathbf{Y}} \{{C}_{\rm{match}}(Y)+{C}_{\rm{state}}(Y)\}\), where Y is the space of all possible five-state alignments.
Note that \(\mathrm{Cost}_{{\rm{match}}}(i,j)\) can be any cost function (for example, KL divergence) that can measure the distance between two expression distributions; however, MML distance enables defining complete descriptions for hypotheses, considering both model complexity and data fit, unlike KL divergence, which computes the expected log-likelihood ratio, disregarding model complexity.
Reporting alignment statistics over gene-level alignmentsDistribution of alignment similaritiesThe distribution of ‘alignment similarity’ statistics (percentage of [M, V, W] in the five-state string generated by G2G for each gene) and their average ‘alignment similarity’ statistic across all genes, quantify the degree of concordance between the reference and query. The genes are ranked from the temporally most distant to most similar using those alignment similarities.Aggregate alignmentG2G generates a single, cell-level (average) alignment across all genes (or any subset of genes) using their optimal alignment landscapes (\(L\) matrices). \(L(i,j)\) gives the optimal ending state of the prefixes, R1‥j and Q1‥i. Across all gene-specific \(L\) matrices, there is a five-state frequency distribution for each \({R}_{j}\) and \({Q}_{i}\). To generate an aggregate alignment, we begin traversal from \(L({|Q|}+1,{|R|}+1)\) and choose the most probable state \(x\in \varOmega\) for \({R}_{{|R|}}\) and \({Q}_{{|Q|}}\) as the most frequent across all genes. Accordingly, we traverse to the next matrix cell and repeat the same process until we reach \(L(\mathrm{0,0})\); for any \(L(i,j)\), if \(x\) = M, the next will be \(L(i-1,j-1)\) and if \(x\) = D, the next will be \(L(i,j-1)\) and so on. Finally, we have an aggregate five-state alignment string.Clustering alignment patternsWe employ agglomerative hierarchical clustering under average linkage criterion (in sklearn v.1.2.2) to identify groups of genes that show similar alignment patterns, given a pairwise distance matrix between all alignment strings. The distance threshold parameter for the linkage controls where the cluster merge stops, allowing inspection of different clustering structures at different levels in the hierarchy.
Defining a distance measure for five-state alignment strings
Clustering alignments require defining a distance measure between two alignment paths. While the polygonal-area-based distance measure62 works for three-state alignments, it cannot distinguish warps from indels. The commonly used string distance measures are: Levenshtein distance and Hamming distance. Levenshtein distance is the minimum number of edits (substitutions, inserts and deletes) needed to transform one string to another. G2G computes pairwise Levenshtein distances between alignment strings (using leven v.1.0.4), normalized by the maximum length of the strings in comparison. Hamming distance is the minimum number of single-character substitutions needed to transform equal-length strings to one another. G2G computes pairwise Hamming distances using scipy.spatial.distance.cdist (in SciPy v.1.10.1), using the alignment strings encoded as equal-length binary strings (of size \({|R|}+{|Q|}\)) (Supplementary Fig. 6a). Each alignment string is binary-encoded by traversing through its alignment path, recording for each \(R\) and \(Q\), the match/mismatch state \(x\) of their respective pseudotime points (\(x\in\) [M, V, W] is encoded by 1; \(x\in\) [I, D] is encoded by 0). These \(R\) and \(Q\) binary strings are then concatenated. Note that both Levenshtein and Hamming distances are normalized to range [0,1].

Choosing the right string distance measure
We tested both Levenshtein and Hamming distances. In hierarchical clustering, the number of clusters decreases as the distance threshold increases. Ideally, the bottom level of an optimal hierarchical clustering of strings shall represent each unique string in all strings (that is, the maximum number of clusters at the minimum distance threshold shall be equal to the number of unique strings). When clustering with Levenshtein distance, we observe this across all datasets. Hamming distance, however, does not guarantee such capture (Supplementary Fig. 6b). This agrees with the theoretical expectation that Levenshtein distance can distinguish all five individual states, whereas Hamming distance can only distinguish matches and mismatches. Therefore, we recommend Levenshtein distance for alignment clustering.

Choosing the distance threshold for hierarchical clustering
A common strategy is to empirically determine the distance threshold based on the mean silhouette coefficient63 (MSC) over all data samples. MSC ranges [−1,1], where a high positive MSC indicates well-separated clusters, a low positive MSC closer to 0 indicates overlapping clusters, and a low negative MSC indicates incorrect assignments. We obtain clustering for eligible thresholds in the range [0,1.0] with 0.01 step size, and compute their MSCs using sklearn.metrics.silhouette_score.

Hierarchical clustering of alignment strings requires a tradeoff
Generally, the best clustering is considered as the one with the highest MSC; however, for strings, we observe that this value is given by the maximum possible number of clusters (equal to the number of unique alignment strings). In gene-level trajectory alignment, many unique alignment patterns can emerge due to subtle differences in their optimal alignment states across pseudotime points. For instance, in our simulated dataset, there are 113 unique strings covering seven alignment patterns (Extended Data Fig. 2c). Our objective is a less noisy, biologically interpretable clustering, and we note the importance of manual inspection to decide on a tradeoff between the number of clusters versus cluster resolution. We recommend choosing the distance threshold that provides a good tradeoff between MSC and the number of clusters in capturing the main alignment patterns. In our simulated dataset, such a tradeoff is given by the threshold 0.22 corresponding to the second highest locally optimal MSC 0.82. This results in 15 clusters, including the seven major clusters giving only a 0.1% mis-clustering rate (the percentage of the number of outliers in all clusters) (Fig. 3e). The rest of the clusters are mini clusters covering 31 (0.8%) alignments separated due to noise such as warps. G2G enables the user to inspect these cluster diagnostics through the distance threshold versus MSC plots and the cluster-specific average alignment patterns.
Pathway over-representation analysisWe select the top \(k\) mismatching genes (with ≤40% alignment similarity) to analyze their biological/signaling pathway over-representation. The identified clusters of genes are also analyzed. We use GSEApy (v.1.0.4) Enrichr62,64,65 wrapper against the MSigDB_Hallmark_2020 (ref. 66) and KEGG_2021_Human pathway gene sets66,67. For all analyses, a 0.05 significance threshold of the adjusted P value (computed using the default hypergeometric test and Benjamini–Hochberg false discovery rate (FDR) correction of GSEApy-enrichr interface) was applied.Determining the best parameter settingG2G has several key parameters: interpolation structure, window_size of the Gaussian kernel used for interpolation and the five-state machine parameters.Interpolation structureThe number of equispaced interpolation time points (\(m)\) over the [0,1] range decide the resolution of a trajectory alignment (a higher \(m\) gives higher resolution).Low resolution can be less representative of the dynamic process, whereas high resolution introduces noise or redundancy. The optimal \(m\) is a tradeoff that depends on the datasets. We use optBinning68 (v.0.18.0) to heuristically decide the optimal \(m\) for reference and query, separately. Using ContinuousOptimalBinning, we first infer an optimal binning of the pseudotime distribution and then use the number of bins produced as the \(m\) for our equispaced interpolation. In all datasets except for the T cell datasets, optBinning returned an equal number of optimal bins for both reference and query. For T cell datasets, we obtained 15 and 14 bins, respectively. For consistency, we selected the minimum (14). We do not use the optimal splits returned by optBinning as this is an irregular binning structure that is inconsistent for alignment.Window_sizeThis controls the effective cell neighborhood toward estimating the weighted mean and variance of expression at each interpolation time point. CellAlign7 found that 0.1 window_size is the most effective for standard single-cell datasets (with a tradeoff between noise and locality); thus, we use the same across all our experiments and analyses.Five-state machineThe parameters \([\Pr\)(M | M), \(\Pr\)(I | I) and \(\Pr\)(M | I)\(]\) were optimized using grid search, while fixing \(\Pr\)(M | M) = 0.99 to enforce the highest probability for continuous matches rather than single-point-matches. [\(\Pr\)(I | I) = 0.1, \(\Pr\)(M | I) = 0.7\(]\) yielded the lowest false mismatch rate across all G2G alignments on our simulated dataset. It remained optimal when varying \(\Pr\)(M | M) in [0.1,1.0] (Supplementary Table 2). Therefore we set it as the default. For initial states, we use \([\Pr\)(M) = 0.99 and \(\Pr\)(D) = \(\Pr\)(I) = 5 × 10−5].Benchmarking against CellAlign and TrAGEDy alignmentDTW gene-level and cell-level high-dimensional alignments were generated using CellAlign’s7 (v.0.1.0) globalAlign function, following interpolation and scaling defined in their documentation. DTW gene-level alignments were clustered using CellAlign’s pseudotimeClust function. Similarly, TrAGEDy’s post hoc-processed DTW alignments were generated using the script published by Laidlaw et al.17, following documentation. The same number of interpolation time points was used across all CellAlign7, TrAGEDy and G2G alignment. Both CellAlign7 and TrAGEDy ran with Euclidean distance for DTW (note that TrAGEDy recommends Spearman correlation which is mathematically undefined for single-gene observations, thus we use Euclidean distance for both cell-level and gene-level TrAGEDy alignment for consistency).DatasetsDatasets for simulated experiments
Simulating different alignment patterns using Gaussian processes
We modeled log-normalized expression of gene \(x\) as a function \(f\) of time \(t\) using a Gaussian process (GP):$$f(t) \sim {GP}(\;\vec{\mu }(t),K(t,t{\prime} ))$$\(\mu\) is the mean vector. \(K(t,t{\prime} )\) is a kernel function evaluating covariance of every pair of finite time points, where \(f\)(t) is evaluated, controlling the \(f\)(t) characteristics (for example, radial basis function (RBF) kernel for generating smooth, non-branching functions; a change point kernel for generating branching functions). GP with a suitable kernel can simulate different trajectory patterns in single-cell gene expression across pseudotime. Following the standard textbook and kernels discussed in literature69,70, we implemented a simulator using GPyTorch (v.1.5.1) for three types of alignment patterns (matching, divergence and convergence), comprising 300 cells spread across pseudotime range [0,1] for each trajectory.

Generating a matching pair of reference and query gene trajectories
We used a GP with a constant \(c\) mean vector \(\vec{{\mu }_{c}}\) (\(c\in [\mathrm{0.5,9.0}]\) uniform random sampled) and RBF kernel \(K\) to sample \(\mu (t)\) that describes an average expression for each time point. Next, we sampled two trajectories: \({{GEX}}_{{ref}}(t)\) and \({{GEX}}_{{query}}(t)\), from a GP with \(\mu (t)\) and kernel \({\sigma }^{2}I\) (\(\sigma \in [\mathrm{0.05,1.0}]\) uniform random sampled and \(I\) = identity matrix).$$\mu (t) \sim N({\vec{\mu }}_{c},K\;)$$$${{GEX}}_{{ref}}(t) \sim N(\;\mu (t),{\sigma }^{2}I\;)$$$${{GEX}}_{{query}}(t) \sim N(\;\mu (t),{\sigma }^{2}I\;)$$

Generating a divergence pair of reference and query gene trajectories
We used a change point (CP) kernel, which imposes a bifurcation in a trajectory as it reaches an approximate time point \({t}_{{CP}}\) (also called a change point). It activates one covariance function before \({t}_{{CP}}\) and another after \({t}_{{CP}}\). We used the below CP kernel69,70:$${K}_{{CP}}(t,t{\prime} )=a{K}_{1}(t,t{\prime} )+a{\prime} {K}_{2}(t,t{\prime} )$$where,$$a=\sigma (t)\sigma (t{\prime} )$$$${a}^{{\prime} }=\left[1-\sigma \left(t\right)\right]\,[1-\sigma (t{\prime} )]$$$$\sigma (x)=\frac{1}{1+\exp (-s(x-{t}_{{CP}}))}\left({\mathrm{sigmoid}}\;{\mathrm{function}}\right)$$with \({\boldsymbol{s}}\) acting as CP steepness parameter. Penfold et al.69 defines a branching process by enforcing a zero kernel (\({K}_{1}\)) before \({t}_{{CP}}\) and another suitable kernel (\({K}_{2}\)) afterwards. We used RBF for \({K}_{2}\). Following is the generative process, starting with a base mean function \(\mu (t)\) sampled from a separate GP with constant \(c\) mean vector \(\vec{{\mu }_{c}}\) (\(c\in [\mathrm{0.5,9.0}]\) uniform randomly sampled) and an RBF kernel \(K\).$$\mu (t) \sim N(\;\vec{{\mu }_{c}},K\;)$$$${f}_{1}(t) \sim N(\;\mu (t),{K}_{{CP}})$$$${f}_{2}(t) \sim N(\;\mu (t),{K}_{{CP}})$$$${{GEX}}_{\rm{ref}}(t) \sim N(\;{f}_{1}(t),{\sigma }^{2}I\;)$$$${{GEX}}_{\rm{query}}(t) \sim N(\;{f}_{2}(t),{\sigma }^{2}I\;)$$
Next, two functions were sampled from a GP with \(\mu (t)\) and CP(t,t’), which were then used as mean vectors to generate \({{GEX}}_{\rm{ref}}(t)\) and \({{GEX}}_{\rm{query}}(t)\) with kernel \({\sigma }^{2}I\) (\(\sigma\) = 0.3, a moderate constant). This ran for [\({t}_{{CP}}\) = 0.25, \({t}_{{CP}}\) = 0.5 and \({t}_{{CP}}\,\)= 0.75] resulting in three groups of divergence with varying bifurcation points (early divergence, mid divergence and late divergence). We then filtered the generated pairs to include simple/clear divergence patterns (stable ground truth with no complex patterns) using basic heuristics such as the difference between mean expression before divergence and at the end terminals of reference and query.
Extended Data Fig. 1a–c shows that the branching may start approximately before CP. Therefore, we expect the early nondivergent segment to continue at least until time point \(i < {t}_{{CP}}\), where we begin to see >0.01 covariance in the CP kernel. Accordingly, given our approx_bifurcation_start_point = i, we expect the range of match lengths to fall between a lower-limit \(n\_{\mathrm{total}\_\mathrm{pseudotime}\_\mathrm{points}}\times i\;\) and upper-limit \(n\_{\mathrm{total}\_{\mathrm{pseudotime}}\_{\mathrm{points}}\times {\mathrm{change}}\_{\mathrm{point}}}\).
Equivalently, we expect mismatch lengths to fall between lower-limit \(n\_{\mathrm{total}\_\mathrm{pseudotime}\_\mathrm{points}}\times \left(1-i\right)\) and upper-limit \(n{\rm{\_}}{\mathrm {total}}{\rm{\_}}{\mathrm {pseudotime}}{\rm{\_}}{\mathrm {points}}\times (1-{\mathrm {change}}{\rm{\_}}{\mathrm {point}})\).

Generating a convergence pair of reference and query gene trajectories
We simply inverted the above generated divergence pairs, as convergence and divergence are complementary to each other.
The final dataset comprises 3,500 pairs, covering seven alignment patterns: 500 matching pairs, 1,500 divergence pairs (500 early + 500 mid + 500 late) and 1,500 convergence (500 early + 500 mid + 500 late).
Simulating mismatches on real scRNA-seq dataWe downloaded the E15.5 mouse pancreas development dataset29 from the CellRank package71, and subsetted to β-cell lineage (1,845 cells) using the original author annotations (‘Ngn3 low EP’, ‘Ngn3 high EP’, ‘Fev+’ and ‘Beta’). We selected the lineage-driver genes that are significantly associated with differentiation potential to β-cells (769 of 2,000 highly variable genes at 1% FDR), using CellRank (v.1.5.1) following their tutorials (https://cellrank.readthedocs.io/en/stable/auto_examples/estimators/compute_lineage_drivers.html). For trajectories, we used the diffusion pseudotime estimated by the CellRank authors. To simulate trajectories for alignment, we divided the pseudotime (between 0 and 1) equally into 50 bins, assigned cells to bins based on their pseudotime and randomly split cells into reference and query datasets in each bin. To simulate deletions of n bins, we excluded query cells from the first n bins (cells where the pseudotime ≤nth bin upper margin). To simulate insertions of n bins, we shifted the query cell expression of the first n bins by the s.d. calculated across all bins for the gene of interest in the query cells. Query pseudotime axis was min–max-normalized after perturbation.Datasets for benchmarking G2GDendritic cell stimulation datasetThe normalized single-cell datasets of PAM/LPS stimulation and their pseudotime estimates (downloaded from the CellAlign7 GitHub repository) were converted into Anndata objects. These contain two gene sets: ‘core antiviral module’ (99 genes) and ‘peaked inflammatory module’ (89 genes), preselected from the original publication20 and referred to as ‘global’ and ‘local’, respectively by Alpert et al.7. The datasets include 179 PAM-stimulated cells and 290 LPS-stimulated cells. CellAlign7 used 200 interpolation points for their PAM/LPS analysis, whereas the optBinning68 structure suggests that 14 points are sufficient.Simulated dataset containing trajectories with no shared processThis is a simulated negative control, generated using the published script by Laidlaw et al.17, which uses DynGen72, a single-cell data simulator for dynamic processes. This dataset contains two trajectories simulated under two different gene regulatory networks and TF activity, ensuring no shared process between them. The reference and query have 619 genes across 2,000 and 1,940 cells, respectively.Dataset for healthy versus IPF case studyWe downloaded healthy and IPF datasets21 from the Gene Expression Omnibus (GEO) (GSE136831) and extracted the lineages AT2 to AT1 cell differentiation for healthy; AT2 to ABC differentiation for IPF (based on original author annotations). We identified a subset of 54 AT2 cells of low quality (with high percentage of ribosomal gene expression) that were filtered out before further analysis. Next, we subsetted healthy and IPF cells to independently preprocess and estimate their pseudotime. This includes 3,157 healthy cells (2,655 AT2 and 502 AT1) from 28 individuals and 890 IPF cells (442 AT2 and 448 ABCs) from 31 individuals. We first normalized their per-cell total transcript count to 10,000 followed by log1p transformation and selected 4,000 HVGs to estimate cell pseudotime using Diffusion Pseudotime35 implemented in SCANPY73 (v.1.9.6) with default parameter settings. The root cell for each trajectory was chosen based on the expression score of known AT2 progenitor cell markers: AXIN2, FGFR2, ID2, FZD6, LRP5 and LRP6 (refs. 74,75) (using scanpy.tl.score_genes). ABC-specific marker genes were obtained from the original paper’s supplementary file (aba1983_data_s2.txt)21.Dataset preparation for in vivo and in vitro T cell development comparisonCell cultures for ATOs and scRNA-seq experimentThe MS5 line transduced with human DLL4 was obtained from G. Crooks (UCLA) as a gift. The MS5-hDLL4 cells were cultured in DMEM (Gibco, 41966) with 10% FBS (Gibco, 16000044). Two iPS cell lines were used in this study. Cell lines HPSI0114i-kolf_2 (Kolf) and HPSI0514i-fiaj_1 (Fiaj) were obtained from the Human Induced Pluripotent Stem Cell initiative (HipSci: www.hipsci.org) collection. All iPS cell lines were cultured on vitronectin-coated (diluted 1:25 in PBS; Gibco, A14700) plates, in TeSR-E8 medium (STEMCELL Technologies, 05990).We followed the PSC-ATO protocol as previously described40. iPS cells were collected as a single-cell suspension and seeded (3 × 106 cells per well) in GFR-reduced Matrigel-coated (Corning, 356231) six-well plates in X-VIVO 15 medium (Lonza, 04-418Q), supplemented with rhActivin A, rhBMP4, rhVEGF, rhFGF (R&D Systems, 338-AC-010/314-BP-010/298-VS-005/233-FB-010) and ROCK inhibitor (Y27632; LKT Labs, Y1000) on day −17 and only rhBMP4, rhVEGF and rhFGF on days −16 and −15. Cells were collected 3.5 days later (day −14) and isolated by fluorescence-activated cell sorting (FACS) for CD326–CD56+ (PE anti-human CD326 antibody, BioLegend, 324205; APC anti-human CD56 antibody, BioLegend, 318309) human embryonic mesodermal progenitors (hEMPs). Representative FACS plots are shown in Supplementary Fig. 7a.Isolated hEMPs were combined with MS5-hDLL4 at a ratio of 1:50. Two or three cell-dense droplets (5 × 105 cells in 6 μl hematopoietic induction medium) were deposited on top of an insert in each well of a six-well plate. Hematopoietic induction medium composed of EGM2 (Lonza, CC-3162) supplemented with ROCK inhibitor and SB blocker (TGF-β receptor kinase inhibitor SB- 431542; Abcam, ab120163) was added into the wells outside the inserts so that the cells sat at the air–liquid interface. The organoids were then cultured in EGM2 with SB blocker for 7 days (days −14 to −7), before the addition of cytokines rhSCF, rhFLT3L and rhTPO (Peprotech, 300-07/ 300-19/300-18) between days −6 to 0. These 2 weeks formed the hematopoietic induction phase. On day 1, the medium was changed again to RB27 (RPMI (Corning, 10-040-CV) supplemented with B27 (Gibco, 17504-044), ascorbic acid (Sigma-Aldrich, A8960-5G), penicillin–streptomycin (Sigma-Aldrich, P4333) and glutaMAX (Thermo Fisher Scientific, 35050061)) with rhSCF, rhFLT3L and rhIL7 (Peprotech, 200-07). The organoids can be maintained in culture for 7 more weeks in this medium.For dissociation of organoids on day −7, they were removed from culture insert and incubated in digestion buffer, which consisted of collagenase type IV solution (STEMCELL Technologies, 07909) supplemented with 0.88 mg ml−1 collagenase/dispase (Roche, 10269638001) and 50 U DNase I (Sigma, 9003-98-9 D5025-15KU), for 20 min at 37 °C. Vigorous pipetting was performed in the middle of the incubation and at the end. After complete disaggregation, a single-cell suspension was prepared by passing through a 50-μm strainer.For dissociation of organoids from day 0 onwards, a cell scraper was used to detach ATOs from cell culture insert membranes and detached ATOs were then submerged in cold flow buffer (PBS (Gibco, 14190144) containing 2% (v/v) FBS and 2 mM EDTA (Invitrogen, 15575020)). Culture inserts were washed and detached ATOs were pipetted up and down to form a single-cell suspension before passing through a 50-μm strainer.Cells were then stained with designed panels of antibodies and analyzed by flow cytometry. FACS was performed at the same time and live human 4,6-diamidino-2-phenylindole (DAPI)− anti-mouse CD29– (Invitrogen, D1306; APC/Cy7 anti-mouse CD29 antibody, BioLegend, 102225) cells were sorted for day −7, day 0 and week 3 samples, and live (DAPI–) cells were sorted for week 5 and week 7 samples before loading onto each channel of a Chromium chip from a Chromium single-cell V(D)J kit (10X Genomics). Representative FACS plots are shown in Supplementary Fig. 7b. The metadata for all the ATO samples can be found in Supplementary Table 8. For the day −14 sample, some sorted (both hEMP and the rest of the DAPI– fraction) and unsorted cells were stained with hashtag antibodies (TotalSeq-C antibodies from BioLegend (Supplementary Table 9), following a 10x cell surface protein-labeling protocol) before being mixed together with some mouse stromal cells (MS5-hDLL4) for 10x loading. For week 1 sample, hashtag antibodies were added at the same time as the FACS antibodies, before sorting. All antibodies used were added as 2 µl per antibody in a total of 100 µl staining solution.Single-cell complementary DNA synthesis, amplification and gene expression (GEX) and cell surface protein (CITE-seq) libraries were generated following the manufacturer’s instructions. Sequencing was performed on the Illumina Novaseq 6000 system. The gene expression libraries were sequenced at a target depth of 50,000 reads per cell using the following parameters: Read1: 26 cycles; i7, 8 cycles; i5, 0 cycles; Read2: 91 cycles to generate 75-bp paired-end reads.ATO data preprocessing and annotationRaw scRNA-seq reads were mapped using Cell Ranger (v.3.0.2) with combined human reference of GRCh38.93 and mouse reference of mm10-3.1.0. Low-quality cells were filtered out (minimum number of reads of 2,000, minimum number of genes of 500, maximum number of genes of 7,000; Scrublet76 (v.0.2.3) doublet detection score <0.15 and mitochondrial reads fraction <0.2). Cells where the percentage of counts from human genes was <90% were considered as mouse cells and excluded from downstream analysis. Cells were assigned to different cell lines (Kolf or Fiaj) using genotype prediction with Souporcell (v.2.4.0)77. The mapping outputs of the eight samples were merged, with sample ID prepended to the barcode IDs in both the BAM and barcodes.tsv to prevent erroneous cross-sample barcode overlap. Souporcell was run with –skip_remap True –-K 2 and the common variants file based on common (≥2% population allele frequency) single-nucleotide polymorphisms (SNPs) from 1000 Genomes Project data, as distributed in the tool’s repository. We selected two clusters due to the already known two cell lines. Next the data went through the standard pipeline of filtering out genes (cell cycle78 genes, genes detected in <3 cells) and normalizing the per-cell total count to 10,000 followed by log1p transformation and scaling to zero mean and unit variance (with max_value = 10 to clip after scaling), using SCANPY73. The final dataset had 31,483 ATO cells with 23,526 genes, which were input to CellTypist41 (v.0.1.4) (for annotation prediction using pretrained logistic regression classifier –Pan_Fetal_Human model under majority voting). We then obtained the Uniform Manifold Approximation and Projection (UMAP) embedding for this dataset based on its scVI42 (v.0.14.5) batch-corrected embedding (with ten latent dimensions, two hidden layers, 128 nodes per hidden layer and 0.2 dropout rate for the neural network) and subsetted cells to non-hematopoietic lineage, T/ILC/NK lineage and other hematopoietic lineage cells (Extended Data Fig. 8) using Leiden clustering. For each lineage, scVI and UMAP embeddings were re-computed and cell types were annotated (low-level annotations in Extended Data Fig. 6b, with more refined annotations in Extended Data Fig. 7a) by inspecting both CellTypist predictions (Extended Data Fig. 7b) and marker gene expression (Extended Data Fig. 8).Joint embedding of reference and organoid for pseudotime estimationWe downloaded the annotated human fetal atlas dataset from https://developmental.cellatlas.io/fetal-immune and extracted cell types (79,535 cells in total) representing the T cell developmental trajectory from progenitor cells toward type 1 innate T cells (T1 dataset), including cycling MPP, HSC_MPP, LMPP_MLP, DN(early) T, DN(P) T, DN(Q) T, DP(P) T, DP(Q) T, ABT(entry) and type 1 innate T cells. We then compiled a reduced representation (20,384 cells), preserving their underlying cell-type composition by random subsampling from each cell type (with minimum sample size of 500 cells, aiming for ~20,000 total number of cells) based on their annotations. Such stratified-sampling is practical for dealing with massive single-cell datasets to reduce resource demands. Separately, we extracted cell types from the ATO dataset (19,013 cells) representing the trajectory from iPS cells toward SP T cells, including iPS cells, primitive streak, mesodermal progenitor, endothelium, HSC_MPP, HSC_MPP/LMPP_MLP/DC2, DN(early) T, DN T, DP(P) T, DP(Q) T, ABT(entry) and SP T cells.T1 and ATO datasets were merged and preprocessed together by filtering out cells with more than 8% total mitochondrial unique molecular identifiers, cell cycle genes78 and genes expressed in <3 cells. Next, HVGs were selected after normalizing per-cell count to 10,000 reads and log1p transformation. T1 pan fetal reference had 33 batches (due to different 10x chemistry 3′ versus 5′ and donors), whereas the ATOs had two batches (due to two cell lines, Kolf and Fiaj). We constructed a batch-corrected, scVI latent embedding (ten latent dimensions, two hidden layers, 128 nodes per hidden layer and 0.2 dropout rate for the neural network) of the merged dataset. This was taken to build the cell neighborhood graph and UMAP embedding using SCANPY73. The final T1 and ATO datasets comprised 20,327 cells and 17,176 cells respectively. A total of 18,436 cells of T1 and 10,089 cells of ATO belong to the T cell lineage (DN T onwards).We followed a similar preprocessing for the pan fetal reference representing the trajectory toward CD8+ T (CD8+ dataset) (including cycling MPP, HSC_MPP, LMPP_MLP, DN(early) T, DN(P) T, DN(Q) T, DP(P) T, DP(Q) T, ABT(entry) and CD8+ T cells). The initially extracted CD8+ subset (83,177 cells) was reduced to 20,412 cells, which was then merged with the 19,013 ATO cells and subjected to the same filtering and normalization carried out for the T1 + ATO merge before scVI integration. The final CD8+ dataset comprised 20,324 cells, of which 18,490 cells were DN T onwards.Pseudotime estimation using GPLVMDifferentiation pseudotime was estimated separately for T1 reference, CD8+ reference and ATO by employing GPLVM43,79 (a probabilistic nonlinear dimensionality reduction method that models gene expression as a function \(f(X)\) of a set of latent covariates X). It enables incorporating time priors when estimating pseudotime as a latent dimension. GPLVM has previously been successful in single-cell trajectory inference to incorporate useful priors79,80,81,82,83 We used the Pyro84 (v.1.8.0) GPLVM implementation with sparse GP inference (32 inducing points), RBF and Adam optimizer, to obtain an optimal two-dimensional latent embedding, where the two dimensions correspond to pseudotime and a second level of latent effects (for example batch), respectively. The pseudotime was assigned a Gaussian prior with cell sampling times as the mean. The second dimension was zero-initialized. GPLVM loss curve reasonably converged in 2,000 iterations at optimization.For the ATO data, GPLVM was initialized with cell sampling (capture) days as the prior. As there was no temporal data for the pan fetal reference, we first approximated time prior for each reference cell as the weighted average of their k-nearest organoid neighborhood (kNN) capture time. A k = 3 organoid neighborhood for a reference cell was obtained using the cKDTree-based search method implemented in BBKNN85 (v.1.5.1) on their scVI-based UMAP embedding. Contribution of each organoid neighbor was weighted according to its distance (the kNN distance vector was softmax-transformed and the normalized reciprocal of each distance was taken as the associated weight, enforcing less contribution from distant neighbors toward the weighted average). This approximation may introduce outliers due to the spatial arrangement of different cell types in the UMAP. Thus, we leveraged the known cell-type annotations to refine the approximation by assigning each reference cell with the average approximated capture time of its cell type. These approximated capture times were scaled to be in [0,1] range and input as the mean of the Gaussian prior of pseudotime to the previously described GPLVM. For T1 and CD8+ GPLVMs, the input gene space was 2,608 genes and 2,616 genes, respectively (the same as at scVI integration). To ensure no outliers, the GPLVM estimated pseudotime was further refined by correcting outliers of each cell type using the cell-type specific average of estimated pseudotime. Outliers were selected based on the interquartile range (IQR) rule (1.5 × IQR below the first quartile and above the third quartile of the cell-type-specific pseudotime distribution).G2G alignmentFor the complete T1 versus ATO comparison using G2G, the total common gene space of 20,240 genes was considered upon filtering genes with <3 cells expressed, 10,000 total count per-cell normalization and log1p transformation. For the DN T onwards comparison, there were 17,718 genes for T1 versus ATO and 20,183 genes for CD8+ versus ATO. These total gene spaces were subsetted to include only the transcription factors44 (1,371 TFs) and relevant signaling pathways focused in this work. G2G alignment was performed using 14 equispaced pseudotime points.High-dimensional alignment using CellAlign and TrAGEDy1
To compare G2G alignment of T1 versus ATO against CellAlign and TrAGEDy alignments, we followed the same steps and parameter settings of those tools as described in the previous section ‘Benchmarking against CellAlign7 and TrAGEDy17 alignment’. The alignment of 14 equispaced interpolation pseudotime points across the 1,371 TFs gave a three-state alignment string (MVVVVVVMWMWWWVWWWMWVVVWW) from CellAlign and a five-state alignment string (IIIIIIMMMWWWWMWWMIDIDID) from TrAGEDy.ATO TNF preliminary validation experimentFor the TNF validation experiment, we followed the PSC-ATO protocol described previously using Kolf iPS cell line. One plate (12 organoids) was set for each condition, with four TNF (R&D Systems, 210-TA) conditions: control (no TNF addition), 1 ng ml−1 (final concentration), 5 ng ml−1 and 25 ng ml−1. TNF was added into the medium between weeks 6–7. Organoids were collected at week 7, stained with hashtag antibodies (TotalSeq-C antibodies from BioLegend) and a designed panel of FACS antibodies and sorted for DAPI–CD45+ cells before 10x loading. Representative FACS plots are shown in Supplementary Fig. 7c. Library construction, sequencing and data preprocessing was the same as the WT ATO organoids.Assessing T cells from ATO before and ATO after TNF treatmentThe ATOs with TNF treatment (ATOTNF) resulted in 123 good quality T cells (data preprocessing and filtering as above; annotated by CellTypist41 as type 1 innate T cells) where the majority (~94% cells) were 1 ng ml−1 TNF-treated (as opposed to wild-type, 5 ng ml−1 and 25 ng ml−1). Thus, we performed the analysis using T cells treated with 1 ng ml−1 TNF (116 cells) that were predicted to be type 1 innate T cells by CellTypist41.The wild-type ATOs without TNF treatment have 6,558 SP T cells and the pan fetal reference39 has 1,413 type 1 innate T cells. To check whether there is an improvement in the maturity of resultant T cells in ATOTNF, we evaluated and compared the degree of similarity between ATOTNF and the reference against that between ATOs and the reference. This assessment was conducted via two directions: cell level and gene level. For cell-level assessment, we constructed an scVI42 latent embedding by integrating the reference, ATOs and ATOTNF together, following the same preprocessing steps carried out for the ATOs and the reference previously. We then computed the Euclidean distance between the mean latent dimensions of in vitro SP T cells and in vivo type 1 innate T cells. For gene-level assessment, we computed the MML distance of gene expression distributions by first identifying significantly distant genes across all the TFs, as well as across several curated pathway gene subsets from KEGG and MSigDB that are associated with TNF signaling (Supplementary Note), before and after TNF treatment. We next tested whether the difference between those distance distributions was significant or not using the Mann–Whitney U-test. The following section describes our approach to identifying genes with a statistically significant distance.Estimating an empirical null distribution of MML distances to assess significance of a distanceWe expect no significant difference in gene expression between any two random subsets of cells belonging to the same cell type in the same system. Thus, we generated such random pairs of cell subsets within our pan fetal reference (type 1 innate T cell) dataset as well as within the ATO (SP T cell) dataset across all 1,371 TFs, to construct an empirical null distribution for MML distance. This enables us to compute a P value for any given MML distance \(x\), indicating how likely \(x\) is to occur by chance.For each TF, we performed uniform random sampling of a subset pair without replacement for 50 iterations, resulting in 137,100 total number of pairs. Each subset contains 50 cells. Then, for each pair, we computed the MML distance. All computed MML distances together form a null distribution which we used to evaluate significance of a given MML distance between two expression distributions.The resultant empirical distribution of MML distances is highly left-skewed. We constructed its empirical cumulative density function (CDF) using distributions.empirical_distribution.ECDF function in statsmodels (v.0.13.5) to test significance in distances under no assumption about the family of distribution. The P values were adjusted for multiple testing using the Benjamini–Hochberg method, before identifying significantly distant genes at a given level of significance.Testing statistical significance of the change in gene expression distances to reference from wild-type ATOs versus ATOTNF
Given the MML distances of the genes that are significantly different: (1) between the reference and ATOs (sample 1); and (2) between the reference and ATOTNF (sample 2), we tested whether the average gene expression distance dropped after TNF treatment. To test the significance of the change in distance, we performed a nonparametric Mann–Whitney U-test implemented in scipy.stats.mannwhitneyu.Testing the closeness of SP T cells in the ATO to the reference type 1 innate T cells before and after TNF treatmentWe compared the 6,558 SP T cells in ATO and the 116 T cells in ATOTNF, against the 1,413 type 1 innate T cells from the pan fetal reference. This was carried out for all 1,371 human TFs, as well as the 196 genes in TNF signaling via the NF-κB pathway (from MSigDb).For each gene set, we first computed the MML distances of the genes between (1) reference and ATOs; and (2) reference and ATOTNF. We then computed a P value for each MML distance under the empirical CDF estimated by computing MML distances between randomly sampled homogeneous subsets across the reference and ATO cells. This allowed us to identify the gene set that was significantly different between reference and ATOs (diffsetUNTREATED) and the gene set that was significantly different between reference and ATOTNF (diffsetTREATED). Next, we compared the average distance of genes in diffsetUNTREATED to the average distance of genes in diffsetTREATED, while testing the significance of the change in the distances using Mann–Whitney U-test.Software and computational requirementsThe G2G framework and all analyses were implemented in Python (≥v.3.8) with commonly used scientific-computing and visualization libraries and the specific libraries mentioned in the Methods. G2G was tested on two operating systems: Ubuntu 20.04.1 LTS and Debian GNU/Linux 10 (buster). The G2G runtime depends on the number of cells in the reference and query datasets, the number of interpolation time points and the number of genes to align. G2G (v.0.1.0) used in this work is a less-efficient version that utilizes concurrency through Python multiprocessing to speed up the gene-level alignment process. It creates a number of processes equal to the number of cores in the system and each process performs a single gene-level alignment at one time. Per-gene runtime for a case aligning 100 genes between 20,327 reference cells and 17,176 query cells (pan fetal reference and ATO datasets), with 14 interpolation points and 16 cores in a Linux (SMP Debian 4.19.304-1 (2024-01-09)) system, was approximately 5.57 s. A notably faster, latest Genes2Genes v.0.2.0 is now available, running sequentially with 0.60 s per-gene for the same case (SMP Debian 5.10.216-1 (2024-05-03)).Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Hot Topics

Related Articles