OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing

Methods for BulkTrajBlendBulkTrajBlend is primarily designed to address the issue of “omitted” cells in single-cell data, making the inference of developmental or differentiation trajectories continuous. To achieve this goal, we designed BulkTrajBlend to generate potential “missing” cells from bulk RNA-seq data for inferring pseudo-time cell trajectories. This process consists of the following four steps (where communities represent cell types):Cell proportion calculationTo estimate the proportion of cells in Bulk RNA-seq, we first annotated the single-cell data with respective cell types and aggregate the gene counts of single cells by cell type, resulting in an $N*M$ matrix, where $M$ represents the number of cell types and $N$ represents the number of genes. We define this $N\times M$ matrix as the simulated Bulk RNA-seq cell type matrix, and then we sum $M$ columns of each row to get the simulated Bulk RNA-seq ${B}_{{simulated}}$, and we input the simulated Bulk RNA-seq into the self-encoder of AE. In the self-encoder, we define the output of the encoder as $T$, and we make $T$ close to $\frac{N{umber\; of\; the\; cell}}{{Number\; of\; all\; cells}}$, i.e., Cell Proportion, by training AE. We then define the output of the generator as $G$ and we make $G$ and ${B}_{{simulated}}$ close to each other by MAE as an evaluation. After training the optimal AE, we change the input to real Bulk RNA-seq ${B}_{{groundtruth}}$, at which time the output of the encoder, $T$, is the Cell Proportion corresponding to real Bulk, which we use as the range of the generator space for the subsequent β-VAE.Generation of single-cell dataGiven a dataset $\{X,V,W\}$, where the vector ${{{{{\bf{x}}}}}}\in {{\mathbb{R}}}^{M}$ in the gene expression matrix $X$ represents gene expression vector of a cell, the vector ${{{{{\bf{v}}}}}}\in {{\mathbb{R}}}^{K}$ in the matrix $V$ represents cell type proportion, satisfying $\log (p\left({{{{{\bf{v}}}}}}|{{{{{\bf{x}}}}}}\right))={\sum }_{k}\log (p\left({\bf v}_{k}|{{{\bf {x}}}}\right)) $, where ${{{\bf{v}}}}$ is restricted by a loss function:$${{{{{\rm{MAE}}}}}}={\sum}_{{{{{{\bf{v}}}}}}}\left|{{{{{\bf{v}}}}}}-\hat{{{{{{\bf{v}}}}}}}\right|$$
(1)
Here $\hat{{{{{{\bf{v}}}}}}}$ is the predicted proportions of certain cell type.The vector ${{{{{\bf{w}}}}}}\in {{\mathbb{R}}}^{K}\,$ in the matrix $W$ represents conditionally correlated generative factor. The factor ${{{{{\bf{w}}}}}}$ is obtained from the same class of cells through the β-VAE Encoder. For each class of cells, the average value after model training represents a class of cell-specific ${\bf {w}}$, and it is not restriected by adding a loss function. According to Higgins et al.26, we hypothesize that gene expression vectors ${{{{{\bf{x}}}}}}$ are generated by a probability model ${p}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\bf{x}}}}}}|{{{{{\bf{v}}}}}},{{{{{\bf{w}}}}}}\right)$, where ${{{{{\boldsymbol{\theta }}}}}}$ represents the generative model parameters. The model learns the joint distribution of the data ${{{{{\bf{x}}}}}}$ and a set of latent variables ${{{{{\bf{z}}}}}}$ (${{{{{\bf{z}}}}}}\in {{\mathbb{R}}}^{M}$, where $M\ge K$) for generating observed data ${\bf {x}}$, i.e., ${p}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\bf{x}}}}}}{{{{{\rm{|}}}}}}{{{{{\bf{z}}}}}}\right) \, \approx \, p\left({{{{{\bf{x}}}}}}{{{{{\rm{|}}}}}}{{{{{\bf{v}}}}}},{{{{{\bf{w}}}}}}\right)$, and approximates the true posterior distribution ${p}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\bf{z}}}}}}{{{{{\rm{|}}}}}}{{{{{\bf{x}}}}}}\right)$ with an approximate posterior distribution ${q}_{{{{{{\boldsymbol{\phi }}}}}}}\left({{{{{\bf{z}}}}}}{{{{{\rm{|}}}}}}{{{{{\bf{x}}}}}}\right)$ that is easier to compute. Our goal is to ensure that the inferred latent variables ${{{{{\bf{z}}}}}}$ capture the generative factors ${{{{{\bf{w}}}}}}$ in a disentangled manner. A disentangled representation implies that individual latent unit is sensitive to variations in a single generative factor while being relatively invariant to variations in other factors. In a disentangled representation, knowledge of one factor can be generalized to new configurations of other factors. The conditionally correlated generative factors ${{{{{\bf{w}}}}}}$ can remain entangled in a separate subset of ${{{{{\bf{z}}}}}}$ and are not used to represent ${{{{{\bf{v}}}}}}$.To achieve this, we minimize the KL divergence between the approximate posterior and the true posterior:$${{{{{\mathcal{K}}}}}} {\mathcal L} ({q}_{{{\boldsymbol{\phi}}} }({{{{{\bf{z}}}}}}|{{{{{\bf{x}}}}}})||{p}_{{{{{{\boldsymbol{\theta }}}}}}}({{{{{\bf{z}}}}}}|{{{{{\bf{x}}}}}}))=-{\sum}_{{{{{{\bf{z}}}}}}}{q}_{{{\boldsymbol{\phi}}} }({{{{{\bf{z}}}}}}|{{{{{\bf{x}}}}}})\log ({p}_{{{{{{\boldsymbol{\theta }}}}}}}({{{{{\bf{x}}}}}}|{{{{{\bf{z}}}}}}){q}_{{{\boldsymbol{\phi}}} }({{{{{\bf{z}}}}}}|{{{{{\bf{x}}}}}}))+\,\log ({p}_{{{{{{\boldsymbol{\theta }}}}}}}({{{{{\bf{x}}}}}}))$$
(2)
Here, ${{{{{\mathcal{K}}}}}}{{{{{\mathcal{L}}}}}}({q}_{{{{{{\boldsymbol{\phi }}}}}}}({{{{{\bf{z}}}}}}{{{{{\rm{|}}}}}}{{{{{\bf{x}}}}}}){||}{p}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\bf{z}}}}}} | {{{{{\bf{x}}}}}}\right))$ is the variational lower bound and can be written as:$${\mathcal L} ({{{{{\boldsymbol{\theta }}}}}},\phi,{{{{{\bf{x}}}}}})={\sum}_{{{{{{\bf{z}}}}}}}{q}_{\phi }({{{{{\bf{z}}}}}}|{{{{{\bf{x}}}}}})\log ({p}_{{{{{{\boldsymbol{\theta }}}}}}}({{{{{\bf{x}}}}}}|{{{{{\bf{z}}}}}}))-{{{{{\mathcal{K}}}}}} {\mathcal L} ({q}_{\phi }({{{{{\bf{z}}}}}}|{{{{{\bf{x}}}}}})||{p}_{{{{{{\boldsymbol{\theta }}}}}}}({{{{{\bf{z}}}}}}|{{{{{\bf{x}}}}}}))$$
(3)
We introduce a constraint to shape the inferred posterior ${q}_{{{{{{\boldsymbol{\phi }}}}}}}\left({{{{{\bf{z}}}}}}{{{{{\rm{|}}}}}}{{{{{\bf{x}}}}}}\right)$ and match it with a prior ${p}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\bf{z}}}}}}\right)$ that controls the capacity of the latent information bottleneck. We set the prior as an isotropic unit Gaussian, $p\left({{{{{\bf{z}}}}}}\right){{{{{\mathscr{\sim }}}}}}{{{{{\mathcal{N}}}}}}\left(0,I\right)$. The constrained optimization problem can be written as:$${\max }_{{{{{{\boldsymbol{\phi }}}}}}{{{{{\boldsymbol{,}}}}}}{{{{{\boldsymbol{\theta }}}}}}}{{\mathbb{E}}}_{{q}_{{{{{{\boldsymbol{\phi }}}}}}}\left({{{{{\boldsymbol{z}}}}}}{{{{{\rm{|}}}}}}{{{{{\boldsymbol{x}}}}}}\right)}\left[\log \left({p}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\bf{x|z}}}}}}\right)\right)\right] {{{\mbox{s.t.}}}} {{{{{\mathcal{K}}}}}}{{{{{\mathcal{L}}}}}}({q}_{{{{{{\boldsymbol{\phi }}}}}}}\left({{{{{\bf{z}}}}}} {{{{{\rm{|}}}}}}{{{{{\bf{x}}}}}}\right) || p\left({{{{{\bf{z}}}}}}\right)) \, < \, \epsilon$$
(4)
Here, $\epsilon$ is the strength of the applied constraint. With this optimization based on MLE, the latent variable ${{{{{\bf{z}}}}}}$ can reflect the character of the ground truth data with lower error. According to β-VAE model31, we can rewrite the problem in Lagrangian form:$${{{{{\mathcal{F}}}}}}\left({{{{{\boldsymbol{\theta }}}}}},{{{{{\boldsymbol{\phi }}}}}},\beta,{{{{{\bf{x}}}}}},{{{{{\bf{z}}}}}}\right)={{\mathbb{E}}}_{{q}_{{{{{{\boldsymbol{\phi }}}}}}}\left({{{{{\bf{z}}}}}}{{{{{\rm{|}}}}}}{{{{{\bf{x}}}}}}\right)}\left[\log \left({p}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\bf{x|z}}}}}}\right)\right)\right]-\beta \left({{{{{\mathcal{K}}}}}}{{{{{\mathcal{L}}}}}}{{{{{\mathscr{(}}}}}}{q}_{{{{{{\boldsymbol{\phi }}}}}}}\left({q}_{{{{{{\boldsymbol{\phi }}}}}}}\left({{{{{\bf{z}}}}}}{{{{{\rm{|}}}}}}{{{{{\bf{x}}}}}}\right){{{{{\rm{||}}}}}}p\left({{{{{\bf{z}}}}}}\right)\right)-\epsilon \right)$$
(5)
where $\beta$ is the regularization coefficient of the constraint, which limits the capacity of ${{{{{\bf{z}}}}}}$ and imposes an implicit pressure for independence in learning the posterior distribution due to the isotropic nature of the Gaussian prior ${p}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\bf{z}}}}}}\right)$. In this model, different values of $\beta$ can alter the level of learning pressure imposed during training, encouraging the learning of different representations. We assume a disentangled representation of the conditional independent data generative factors ${{{{{\bf{v}}}}}}$ and therefore set $\beta \, > \,1$ to apply a stronger constraint on the latent variable information bottleneck, exceeding the constraint of the original VAE. These constraints restrict the capacity of ${{{{{\bf{z}}}}}}$ and, combined with the pressure to maximize the log-likelihood of the training data ${{{{{\bf{x}}}}}}$, encourage the model to learn the most efficient representation of the data.Computation of single-cell neighborhood graphHere, we used the scanpy.pp.neighbors function from Scanpy to compute the cell neighborhood graph. For detailed mathematical description, please refer to the relevant papers and documentation of nearest neighbor descent in Scanpy and PyNNDescent62.Community detection and generation of overlapping cell communitiesWe performed community detection on the cell neighborhood graph using a Graph Neural Network (GNN) model to find overlapping cell communities33. GNN can learn relationships between nodes and divide them into different communities based on their similarities. Specifically, we used GCN, which is one of the basic models in GNN, to generate an affinity matrix ${{{{{\rm{G}}}}}}$, which represents the degree of association between cells. The computation is as follows:$${{{{{\rm{G}}}}}}:={{{\mbox{GNN}}}}_{{{{{{\boldsymbol{\theta }}}}}}}\left({{{{{\rm{A}}}}}},{{{{{\rm{X}}}}}}\right)$$
(6)
Here, $A$ is the adjacency matrix of the cell neighborhood graph, and ${{{{{\boldsymbol{X}}}}}}$ represents cell type as the node feature. To ensure non-negativity of ${{{{{\rm{G}}}}}}$, we applied element-wise ReLU non-linear activation function to the output layer. For detailed information about the GNN architecture,$${{{{{\rm{G}}}}}}:={{{\mbox{GCN}}}}_{{{{{{\boldsymbol{\theta }}}}}}}\left(A,X\right)={{{{{\rm{ReLU}}}}}}\left(\hat{A}{{{\mbox{ReLU}}}}\left(A\hat{A}X{W}^{\left(1\right)}\right){W}^{\left(2\right)}\right)$$
(7)
Here, $\hat{A}={\hat{D}}^{-\frac{1}{2}}\widetilde{A}{\hat{D}}^{-\frac{1}{2}}$ is the normalized adjacency matrix, $\widetilde{A}=A+{I}_{N}$ is the adjacency matrix with self-loops, and ${\hat{D}}_{{ii}}={\sum }_{j}{\widetilde{A}}_{{ij}}$ is the diagonal degree matrix of the adjacency matrix with self-loops. We considered other GNN architectures and deeper models but did not observe significant improvements. Two main differences between our model and the standard GCN are: (1) batch normalization applied after the first graph convolutional layer, and (2) L2 regularization applied to all weight matrices. We found that both modifications significantly improved the performance.We measured the fit between the generated affinity matrix $F$ and the neighborhood graph using the negative log-likelihood function of the Bernoulli-Poisson model:$$-\,\log p(A|F)=-{\sum}_{(u,v)\in E}\log (1-\exp (-{F}_{u}{F}_{v}^{T}))+{\sum}_{(u,v)\notin E}{F}_{u}{F}_{v}^{T}$$
(8)
Here, $E$ represents the set of edges in the graph. Since neighborhood graphs of single-cell data are typically sparse, the second term in the third sum contributes more to the loss. To balance these two terms, we adopted a standard technique known as balanced classification18, and defined the loss function as follows:$${{{{{\rm{L}}}}}}\left({{{{{\rm{F}}}}}}\right)=-{{\mathbb{E}}}_{\left(u,v\right)\sim {P}_{E}}\left[\log \left(1-\exp \left(-{F}_{u}{F}_{v}^{T}\right)\right)\right]+{{\mathbb{E}}}_{\left(u,v\right)\sim {P}_{N}}\left[{F}_{u}{F}_{v}^{T}\right]$$
(9)
Here, ${P}_{E}$ and ${P}_{N}$ represent uniform distributions over edges and non-edges, respectively.Instead of directly optimizing the affinity matrix $F$ as in traditional methods, we search for the optimal neural network parameters ${{{{{{\boldsymbol{\theta }}}}}}}^{{{{{{\boldsymbol{*}}}}}}}$ to minimize the (balanced) negative log-likelihood function:$${{{{{{\boldsymbol{\theta }}}}}}}^{*}={\arg } \,{\min }_{{{{{{{\boldsymbol{\theta}}} }}}}} {{{{{\rm{L}}}}}} \left({{{\mbox{GCN}}}}_{{{{{{\boldsymbol{\theta }}}}}}} \left(A,X\right)\right)$$
(10)
Through these steps, the BulkTrajBlend model computes overlapping communities in single-cell data, which can be used to infer “omission” cells in the original single-cell data. It can help reveal cell type transitions and dynamics, and model and analyze cell developmental trajectories.Community trajectory inferenceHere, we inserted the overlapping communities of target cells into the original single-cell data and used PyVIA to infer pseudo-temporal trajectories of cell differentiation. For detailed inference methods, please refer to the mathematical description of PyVIA. Additionally, researchers can also use CellRank for community trajectory re-inference.CGAN and ACGAN model descriptionCGAN (Conditional Generative Adversarial Nets) is a GAN (Generative Adversarial Nets) based model that generates data by training the generator and discriminator with the data and corresponding labels. The training process can be split into 2 parts. In the first part, latent variables ${{{{{\bf{z}}}}}}\in {{\mathbb{R}}}^{M}\left(M{{{{{\boldsymbol{=}}}}}}{{{{{\bf{100}}}}}}\right)$ are generated by standardized normal distribution and its generated class labels ${{{{{{\boldsymbol{l}}}}}}}_{{{{{{\boldsymbol{g}}}}}}}$ are input into the generator to get the generated data. Here the generator can be summarized as a function ${g}_{{{{{{\boldsymbol{\theta }}}}}}}$, where ${{{{{\boldsymbol{\theta }}}}}}$ are the parameters of the MLP and there are 6 layers in that each layer is normalized. The the hidden dimensions are 128*256*512*1024 and the activation function is LeakyRelu. After getting the generated data ${{{{{\boldsymbol{g}}}}}}={g}_{{{{{{\boldsymbol{\theta }}}}}}}({{{\bf{z}}}},{l}_{g})$, there will be a discriminator ${d}_{{{{{{\boldsymbol{\phi }}}}}}}$, where ${{{{{\boldsymbol{\phi }}}}}}$ are the parameters of the MLP and there are 4 layers in each layer the hidden dimension is 512, dropout rate is 0.4 and the activation function is LeakyRelu, judging whether ${{{{{\boldsymbol{g}}}}}}$ accords with its label ${{{{{{\boldsymbol{l}}}}}}}_{{{{{{\boldsymbol{g}}}}}}}$. Therefore, in the second part, ${d}_{{{{{{\boldsymbol{\phi }}}}}}}$ will be trained by the real data ${{{{{\boldsymbol{r}}}}}}$ and its label ${{{{{{\boldsymbol{l}}}}}}}_{{{{{{\boldsymbol{r}}}}}}}$ with Adam optimizer to improve the judgement level of$\,{d}_{{{{{{\boldsymbol{\phi }}}}}}}$. Then the loss of ${g}_{{{{{{\boldsymbol{\theta }}}}}}}$ judged by ${d}_{{{{{{\boldsymbol{\phi }}}}}}}$ will be employed to enhance the generation ability of ${g}_{{{{{{\boldsymbol{\theta }}}}}}}$ with the same optimizer. The loss functions for ${g}_{{{{{{\boldsymbol{\theta }}}}}}}$ and ${d}_{{{{{{\boldsymbol{\phi }}}}}}}$ are both MSEloss and the weights of the loss of the generative data and the real data are both 0.5.In addition, ACGAN (Auxiliary Classifier GAN), which makes the generative data more authentic, keeps the same structure of the generator as the one in the CGAN, but it adds the classifier that offers the label of the input data on the output of the discriminator. In the training process, the loss function for the added classifier is CrossEntropy.Data pre-processingAll single-cell data used for BulkTrajBlend training underwent the same quality control steps: Cells with low sequencing counts (<1000) and a high mitochondrial fraction (>0.2) were excluded in further analysis. The filtered count matrix was normalized by dividing the counts of each cell by total molecule counts detected in that particular cell and logarithmised with Python library scanpy63. All Bulk RNA-seq were normalized using DEseq2 and “numpy.log1p” logarithmised using Python’s Numpy64 package. It is worth noting that both Bulk and single-cell data use raw counts during AE estimation of the cell fraction state space, whereas both Bulk and single-cell data use normalized and logarithmised data during training of β-VAE.Performance evaluationTo evaluated the generated and interpolation performance of our model, a comprehensive analysis was conducted, encompassing the examination of five critical dimensions:(1) The count of interpolated cells, we counted the number of cells that were eventually used to interpolate into the raw single-cell profile.(2) The correlation in marker gene expression between interpolated and authentic cells, we first use scanpy’s “scanpy.tl.rank_genes_groups” function to calculate the marker genes for each type of cell subpopulation in the raw single-cell profile (taking the top 200 marker genes). Then, we use the Pearson coefficient to calculate the percentage of these 200 marker genes in the expression correlation between the generated single-cell profile and the raw single-cell profile.(3) Marker gene similarity, we first used scanpy’s “scanpy.tl.rank_genes_groups” function to calculate the marker genes for each type of cell subpopulation (taking the first 200 marker genes) in the raw single-cell profile versus the generated single-cell profile, respectively. Then, we treated marker genes as words and all the marker genes of each cell class as sentences, and used cosine similarity to calculate the similarity of marker genes of each cell subpopulation.(4) Transition probabilities post-interpolation We first wrapped “omicverse.pp.scale” and “omicverse.pp.pca” in omicverse, “omicverse.utils.cal_paga”, and computed the principal component PCA of the single-cell profile. We took the first 50 principal components and used the scanpy’s “scanpy.pp.neighbour” to compute the neighborhood map of the single-cell profile. Immediately after that, we calculated the developmental trajectory of single-cell profile with pseudotime using pyVIA, and we calculated the state transfer confidence for each type of cell subpopulation by taking pseudotime as the priority time with the neighborhood graph as the input of “omicverse.utils.cal_paga”.(5) The number of noise clusters, we used “scanpy.tl.leiden” in scanpy to perform unsupervised clustering on the generated single-cell profiles, with the resolution set to 1.0, and we identified the categories with less than 25 cells after clustering as noisy clusters and counted the number of noisy clusters as an assessment of the generation quality.(6) Density assessment of pseudotime, after we obtained the pseudotime of single-cell profiles using pyVIA as the default parameters, specifically setting K to 15 in the neighborhood graph of the KNN and configuring use_rep to X_pca.We assessed the variance of the pseudotime of target interpolated cells as one of the metrics for the assessment of developmental trajectory reconstruction.DatasetsDentate GyrusSingle-cell RNA-seq: Data from Hochgerner et al.65. Dentate gyrus (DG) is part of the hippocampus involved in learning, episodic memory formation and spatial coding. The experiment from the developing DG comprises two time points (P12 and P35) measured using droplet-based scRNA-seq (10x Genomics Chromium). The dominating structure is the granule cell lineage, in which neuroblasts develop into granule cells. Simultaneously, the remaining population forms distinct cell types that are fully differentiated (e.g., Cajal-Retzius cells) or cell types that form a sub-lineage (e.g., GABA cells) (Accession ID GSE95753).Bulk RNA-seq: Data from Cembrowski et al.66. Dentate gyrus (DG) is measured by RNA sequencing (RNA-seq) to produce a quantitative, whole genome atlas of gene expression for every excitatory neuronal class in the hippocampus; namely, granule cells and mossy cells of the dentate gyrus, and pyramidal cells of areas CA3, CA2, and CA1 (Accession ID GSE74985).Pancreatic endocrinogenesisSingle-cell RNA-seq: Data from Bastidas-Ponce et al.67. Pancreatic epithelial and Ngn3-Venus fusion (NVF) cells during secondary transition with transcriptome profiles sampled from embryonic day 15.5. Endocrine cells are derived from endocrine progenitors located in the pancreatic epithelium. Endocrine commitment terminates in four major fates: glucagon- producing α-cells, insulin-producing β-cells, somatostatin-producing δ-cells and ghrelin-producing ε-cells (Accession ID GSE132188).Bulk RNA-seq: Data from Bosch et al.68. RNA-sequencing was performed of pancreatic islets (islets of Langerhans) from mice on PLX5622 or control diet for 5.5 or 8.5 months (Accession ID GSE189434).Human bone marrowSingle-cell RNA-seq: Data from Setty et al.69. The bone marrow is the primary site of new blood cell production or haematopoiesis. It is composed of hematopoietic cells, marrow adipose tissue, and supportive stromal cells. This dataset served to detect important landmarks of hematopoietic differentiation, to identify key transcription factors that drive lineage fate choice and to closely track when cells lose plasticity (https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45).Bulk RNA-seq: Data from Myers et al (2018). RNA-Seq of CD34+ Bone Marrow Progenitors from Healthy Donors (Accession ID GSE118944).Maturation of murine liverSingle-cell RNA-seq: Data from Liang et al.70. A total of 52,834 single cell transcriptomes, collected from the newborn to adult livers, were analyzed. We observed dramatic changes in cellular compositions during liver postnatal development. We characterized the process of hepatocytes and sinusoidal endothelial cell zonation establishment at single cell resolution. We selected Pro-B, Large Pre-B, SmallPre-B, B, HPC, GMP, iNP, imNP, mNP, Basophil, Monocyte, cDC1, cDC2, pDC, aDC, Kupffer, Proerythroblast, Erythroblast, erythrocyte (Annotation could be found in metadata of Data from Liang et al.) to performed HPC differentiation analysis (Accession ID GSE171993).Bulk RNA-seq: Data from Renaud et al.71. We analyze gene expression patterns in the developing mouse liver over 12 distinct time points from late embryonic stage (2 days before birth) to maturity (60 days after birth). Three replicates per time point (Accession ID GSE58827).Construction of Simulated “omission” single-cell profileTo simulate the cell “omission” in single-cell sequencing, we conducted cell dropout experiments across diverse datasets. In the Pancreas dataset, we employed Leiden clustering and manually excluded specific clusters of Ngn3 high EP, resulting in a reduction of confidence in the transition from Ngn3 high EP to Pre-endocrine to 0. In the Dentategyrus dataset, we applied Leiden clustering and manually removed specific clusters of Granule Immature, leading to a confidence reduction in the transition from Granule Immature to Granule Mature to 0. Furthermore, in the BoneMarrow dataset, we randomly eliminated 80% of the cells from HSC-2, causing a confidence drop in the transition from HSC-2 to Monocyte-2 to 0.To employ BulkTrajBlend for generating “omission” cells across various datasets, we generated single-cell data from the bulk RNA-seq data using BulkTrajBlend and filtered out noisy cells using the size of the Leiden as a constraint. In configuring the model for different datasets, we set the hyperparameter “cell_target_num” to be 1.5 times, 1 time, and 6 times the number of dropped-out cell types, aligning with Pancreas, Dentategyrus, and BoneMarrow, respectively. Subsequently, BulkTrajBlend calculated the overlapping cell types in the generated single-cell data, and we annotated the overlapping cell communities. Specifically, we selected the single-cell data in which dropped-out cell types were associated with adjacent cell types.Methods of OmicVerse integrationWe unified the downstream analyses of Bulk RNA-seq, single cell RNA-seq in OmicVerse. Since the downstream analyses are independent of the parameter evaluation of BulkTrajBlend and the analysis modules of each part are independent of each other, we have placed the datasets and methods used in each part in Supplementary, an index of which is provided here.

(1)

Bulk RNA-seq: All datasets selected, parameter setting, and methods could be found in Supplementary Note 4.

(2)

scRNA-seq: All datasets selected, parameter setting, and methods could be found in Supplementary Note 5.

(3)

Multi-omics: All datasets selected, parameter setting, and methods could be found in Supplementary Note 6.

Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis