Building a learnable universal coordinate system for single-cell atlas with a joint-VAE model

Overview of UniCoordWe developed UniCoord to learn key features that represent cellular heterogeneities from scRNA-seq data. We devised a specially tuned joint-VAE model to represent the transcriptomic profile of a single cell in a low-dimensional latent space. A conventional VAE model31 includes an encoder and a decoder (Fig. 1, see “Methods” for details). The encoder transforms the transcriptomic profile of a single cell into a set of means to represent the features about cellular heterogeneities, and a set of variances to deal with uncertainty. The decoder samples from the latent space according to the distribution defined by the means and variances learned by the encoder, and then transformed the sample into a generated transcriptomic profile.Fig. 1: The schematic diagram of UniCoord.The encoder transforms the transcriptomic profile of a single cell into a low-dimensional latent space. The decoder samples from the latent space and transformed the sample into a generated transcriptomic profile. The latent dimensions can either be interpretable (supervised) or unsupervised, and can either be discrete or continuous. The latent dimensions can be reconfigured to generate pseudo transcriptomic profiles with desired properties.The goal of UniCoord is to represent the transcriptomic profiles of cells in the latent space with interpretability. We thus designed each dimension in the latent space to be either supervised by prior knowledge or unsupervised. We trained the supervised dimensions to capture information corresponding to well-defined features of the cell, such as the activity of a biological pathway, the differentiation stage, or the clinical diagnosis of the cell’s donor. We trained the unsupervised dimensions to capture complementary, yet unknown information of the cell, such as an unsupervised classification or score of cells. Through this approach, the latent dimensions can be regarded as the universal coordinates of cells.Both discrete and continuous features can be used to represent cellular heterogeneities. For instance, the cell type and sequencing platform are commonly considered discrete features in RNA-seq studies. The spatial, temporal, and functional gradients are continuous features of vital importance to the analysis of biological processes. We proposed a model based on joint-VAE32, a disentangled representation framework that can deal with both discrete and continuous features in a single model, to handle these multifaceted heterogeneities. We considered two main aspects of loss in model training. We considered the reconstruction loss between the original and constructed data, which guaranteed the accuracy of the VAE model. Besides, we designed different loss functions for different forms of supervised features to make these features consistent with prior knowledge (see “Methods” for details).Generating pseudo-single-cell data by in silico reconfigurationAbundant transcriptomic profiles of cells are essential in the analysis of physiological and pathological processes. However, the cell numbers in existing data are often not sufficient, and it is challenging to experimentally observe certain cell types or states, especially for intermediate states. We thus introduced a feature of UniCoord named in silico reconfiguration to generate cell pseudo-single-cell transcriptomic profile data with desired features. After training, we can modify the value of any latent dimension of any cell to the value we expect (such as altering a sequencing platform to another), and then use UniCoord to generate a new transcriptomic profile. By this way, we can “reconfigure” the cell into a new one with desired properties.We first evaluated in silico reconfiguration of discrete features, such as the sequencing platforms and cell types. We trained a UniCoord model with the lung data in the human Ensemble Cell Atlas (hECA)4, and used cell types, sequencing platforms, and unsupervised continuous latent dimensions as the latent dimensions. The original scRNA-seq data derived from different sequencing platforms were separated in the Uniform Manifold Approximation and Projection (UMAP) plot (Fig. 2a), which is mainly due to the distinct distribution of data such as the median number of expressed genes in each cell (Supplementary Fig. 1a). We used the latent representation of original cells as the seed and reconfigured the mean value of the latent dimension corresponding to the sequencing platform into “10X”. We then sampled random variables from the modified distribution and used the decoder to generate pseudotranscriptomic profiles according to these sampled random variables. We found that the generated cells were joined together and clustered well by cell types (Fig. 2b). These generated cells showed similar distributions of the number of expressed genes, regardless the platform that their corresponding original cells were derived from (Supplementary Fig. 1b). We then reconfigured the cell types of all cells into a specified one. When we reconfigured the cell type as T cells and the sequencing platform as “10X”, the generated data highly expressed markers of T cells, including PTPRC, CD3E, and TRAC (Fig. 2c, d). When we reconfigured the cell type as fibrocytes and the sequencing platform as “10X”, these immune-related markers were turned off, and extracellular matrix markers such as DCN and COL1A1 were highly expressed (Fig. 2e, f). We also reconfigured the cell type as T cells/CD8 T cells and the sequencing platform as “Microwell-seq” of all cells. The generated cells also showed to be similar to the corresponding original cells (Supplementary Fig. 1c, d).Fig. 2: In silico reconfiguration with UniCoord generates cells with designated features.a UMAP plots showing the original hECA lung cells colored by cell types (left) or sequencing platforms (right). b UMAP plots showing the generated hECA lung cells with the sequencing platform reconfigured into “10X”, colored by cell types (left) and original sequencing platforms (right). c The UMAP plot of original cells (blue) and cells with the cell type reconfigured into T cells (yellow). d Expression levels of T-cell and fibrocyte markers in cells with the cell type reconfigured into T cells. e The UMAP plot of original cells (blue) and cells with the cell type reconfigured into fibrocytes (yellow). f Expression levels of T-cell and fibrocyte markers in cells with the cell type reconfigured into fibrocytes. g Vascular endothelial cells from four mouse organs, colored by tissues (left), vessel types (middle), or normalized artery-capillary-vein zonation scores (right). The zonation scores were normalized to be between 0 and 1. h UMAP plots of original cells (blue) and cells with zonation scores reconfigured into different values (yellow). For both models, all genes in the dataset were adopted to perform the experiments, and the number of unsupervised latent dimensions was set as 50. a, b All genes were used for visualization. c, e, g, h Highly variable genes (HVGs) were used for visualization. c, e, h For each generated cell, we calculated its nearest neighbor in the original dataset for visualization.In silico reconfiguration can also be applied to continuous features. To demonstrate this, we trained a UniCoord model with the data of vascular endothelial cells from four tissues (brain, heart, liver, and testis) in a mouse endothelial atlas33, with tissue origin, zonation scores, and unsupervised continuous latent dimensions as the latent dimensions. The original study discovered an artery-capillary-vein trajectory in vascular endothelial cells and raised a zonation score to describe the location of a cell on the trajectory. Thus, we used the information of tissues, zonation scores (normalized to be between 0 and 1), and unsupervised continuous latent dimensions as the latent dimensions. The result showed that UniCoord successfully reconstructed the information of tissues as well as the zonation trajectory (Fig. 2g). We then reconfigured the zonation scores of all cells and generated pseudo vascular endothelial cells. The generated cells were accurately placed at the designated locations on the trajectory (Fig. 2h and Supplementary Movie 1). These results demonstrated the advantage of in silico reconfiguration with UniCoord in generating pseudo cells with desired properties, which could help analyze and integrate datasets from different sources.Interpolating timestamps or spatial coordinates to fill data gapsCell state transitions are always continuous, but it is impossible to obtain continuous observation of cellular transcriptomic profiles through high-throughput sequencing technologies. One common approach to obtain time-series single-cell transcriptomic data is to measure samples at different time points. However, there still exist inevitable gaps between time points. We thus used the in silico reconfiguration approach to obtain a more continuous and comprehensive view of cell state transitions by timestamp interpolation. We trained a UniCoord model with a mouse iPSC reprogramming dataset34. In this dataset, cells were treated with doxycycline to induce mouse embryonic fibroblasts (MEFs) de-differentiating into iPSCs, and were then transferred to either serum-free 2i medium or serum medium on day 8 (Fig. 3a). Cells in the serum medium are more likely to re-differentiate into stromal cells and neurons (Fig. 3b, c). The dataset covered a total of 18 days at half-day intervals, forming a discrete trajectory (Fig. 3a). We used timestamps and unsupervised continuous latent dimensions as the latent dimensions. After training, we sampled cells from each time point and used their latent representations as seeds. We reconfigured the latent dimension denoting the timestamp of each sampled cell by adding the original value with a uniformly random variable between –0.5 to 0.5 day. Through this approach, we interpolated the missing time points and created a continuous trajectory (Fig. 3d). We found that the trend in the evolution of cell types became more evident (Fig. 3f). Furthermore, though we did not encode any information about the experimental design, the interpolated results showed clear subgroups of cells with different treatments (Fig. 3e).Fig. 3: UniCoord interpolated discrete timestamps or spatial coordinates into continuous trajectories.a–c Mouse iPSC reprogramming data visualized by the low-dimensional visualization provided by the original study, cells colored by sampling days (a), treatments (b), and cell types (c). d–f UniCoord-interpolated mouse iPSC reprogramming data, cells colored by continuous sampling time (d), treatments (e), or cell types (f). Dox doxycycline, iPS induced pluripotent stem cells, MET cells undergoing a mesenchymal-to-epithelial transition, OPC oligodendrocyte precursor cells, RG radial glial cells. g PCCs between the restored data of day 1 and the mean gene expression of the original data. h Mean PCCs between restored data of each timestamp and the mean gene expression of the original data. g, h Data of a timestamp were excluded, and then the unclassified cells of this timestamp were restored by reconfiguring the timestamp of all other unclassified cells as this timestamp. Cells in days 0–8 were restored and compared in this experiment. i, j UMAP plots showing real (i) and interpolated (j) CMs from the left ventricle. The corresponding layer number was shown in the legend of (i). For both models, all genes in the dataset were adopted to perform the experiments, and the number of unsupervised latent dimensions was set as 50. HVGs were used for visualization all UMAP plots. k PCCs between restored data and the mean gene expression of the original data. Data of each sample layer are excluded, respectively, and then restored by reconfiguring cells of all other sample layers. The reconfigured target timestamp/sample layer is shaded in gray in (g, k).We then investigated whether UniCoord can restore the transcriptomic profiles of cells unseen in training. We excluded the data with a certain timestamp during UniCoord training and tried to restore the excluded data by timestamp reconfiguration (see “Methods” for details). We first excluded the data of day 1 and restored the cells with the “unclassified” label by reconfiguring the timestamp of all other unclassified cells as day 1. We calculated the mean gene expression of the unclassified cells of each day and compared the restored data with them by the Pearson’s correlation coefficient (PCC). As shown in Fig. 3g, the restored unclassified cells showed to be most similar to the original unclassified cells of day 1.5. We performed similar experiments to restore unclassified cells from day 0 to day 8. We found that though the restored data showed a high PCC with the original data with the corresponding timestamp, they showed to be most similar to the original data with adjacent timestamps (Fig. 3h). This may be due to the absence of the corresponding timestamp data in the training dataset preventing the model from precisely capture the trend of the data and preferring to infer it from data with adjacent timestamps.UniCoord can also be applied to reconstruct spatial trajectories by interpolating spatial coordinates. In our recent work on human heart cell atlas28, we sampled cardiomyocytes (CMs) from four layers of the left ventricle with different depths, and found these CMs from different layers exhibited distinct characteristics (Fig. 3i). We used UniCoord to interpolate the continuous change between these layers. We trained a UniCoord model with the data using the information of sample layers and unsupervised continuous latent dimensions as the latent dimensions, where the four sample layers are treated as numeric layer 1–4, respectively. After training, we sampled cells from each layer, and reconfigured the latent dimension denoting the layer information by adding the original value with a uniformly random variable between –1 to 1. We generated pseudo cells with this approach and found that the generated data formed a continuous spatial trajectory (Fig. 3j). We then excluded data from layers 1, 2, 3, and 4, respectively, and used UniCoord to restore these unseen data by reconfiguring cells of all other sample layers (“Methods”). As shown in Fig. 3k, the restored data of layers 1, 2, and 3 showed most similar to the original data for the corresponding layer. The restored data of layer 4 showed comparable similarity to the original data of layers 3 and 4, which may be due to the fact that the extrapolation task is more difficult than the interpolation task. All the results demonstrated the versatility and potential of UniCoord in reconstructing spatial and temporal trajectories by interpolating timestamps or spatial coordinates, allowing for more comprehensive and accurate analyses of complex biological processes.Pre-training UniCoord model with cell atlas data for analyzing disease-related cellsWe applied UniCoord on the hECA data4 which has a total of 1.09 million cells to represent the cellular heterogeneity in the atlas. We randomly sampled 50,000 cells from the dataset and used these cells to train a UniCoord model. We used three aspects as the latent dimensions: cell types, sequencing platforms, and biological process (BP) features that represent the information of specific biological processes. To calculate BP features, we first applied AUCell35 on the transcriptomic profiles of the training cells to convert gene expression levels into the activity strengths of Gene Ontology Biological Process (GOBP) terms36,37. We then trained a random forest model that used these strengths to classify cell types, and identified the top 100 GOBP terms with the highest important scores. After, we clustered these terms into 30 groups and selected the term that showed the activity strength in each group. These selected GOBP terms were regarded as the key GOBPs (Supplementary Fig. 2 and Supplementary Table S1), and the AUCell scores of these key GOBPs were regarded as values of BP features (see “Methods” for details).We took the UniCoord model trained by hECA data as a pretrained model to analyze data that were not included in the training dataset. We used the model to represent cells in a hepatocellular carcinoma (HCC) dataset38. The dataset contains 56,721 cells from 46 distinct liver tumor samples (Fig. 4a, b). As shown in Fig. 4c, d, cells represented by the UniCoord model were less sensitive to the batch of samples. UMAP visualizations of the GOBP latent dimensions of these cells showed similar results (Supplementary Fig. 3b), suggesting that the batch effects were eliminated during the calculation of GOBP activity strength. However, when visualizing the representation results of malignant cells, we can still find some differences among different patients (Supplementary Fig. 3c, d), suggesting that the heterogeneities of malignant cells were not totally overlooked by the UniCoord model.Fig. 4: Analyze HCC data using the UniCoord model pretrained by hECA data.a, b The UMAP plot showing the landscape of the HCC dataset, represented by PCA. Cells are colored by cell types (a) or sample IDs (b). c–e The UMAP plot showing the landscape of the HCC dataset, represented by the pretrained UniCoord model. Cells are colored by cell types (c), sample IDs (d), and UniCoord-predicted cell types (e). f The relations between original labels (top) and cell types predicted by UniCoord (bottom). Protein coding genes in the dataset were adopted to perform the experiments. All genes were used for visualization all UMAP plots.We then used UniCoord to annotate cell types of the HCC data (Fig. 4e). We found that the cell types unseen in hECA data were predicted as the related cell types (Fig. 4f). For example, cancer-associated fibroblasts (CAFs) were mainly annotated as smooth muscle cells as they share several important markers such as α-smooth muscle actin39. Besides, tumor-associated macrophages (TAMs) and tumor endothelial cells (TECs) were mainly annotated as their corresponding progenitor cell type, respectively. Interestingly, malignant cells were predicted as “others” which is a mixture of all cell types with small quantities in hECA. This suggested that the pretrained UniCoord model successfully discovered that malignant cells were different from any cell type encoded in the model. Besides, B cells in the HCC data were also mainly predicted as “others”, suggesting that the state of these B cells was deviated from healthy B cells.We investigated the BP features in the representation of the HCC data (Supplementary Fig. 3a, see “Methods” for details). We found that BP features such as B cell receptor signaling pathway, phagocytosis & recognition, and peptide cross-linking were highly activated in B cells. Connective tissue development, muscle cell development, and extracellular matrix organization were highly activated in CAFs. These BP features are highly related to the functions of the corresponding cell type. Malignant cells exhibited the lowest score of positive regulation of cell killing, which is consistent with their uncontrolled proliferation. The results showed the feasibility of UniCoord as an interpretable pretrained model for representing and decoding complex cell heterogeneities.We then investigated whether UniCoord can capture patient-specific information. We trained a UniCoord model with patient IDs and cell types as latent dimensions. We analyzed the five patients with the highest number of malignant cells. We generated new cells by reconfiguring the patient ID of malignant cells from patient H72 as H70. We calculated the correlation between each generated cell and the mean gene expression (Supplementary Fig. 3e). It can be seen that although the generated cells were reconfigured from H72, they showed much higher correlation with H70. This result indicated that UniCoord can be used to identify patient-specific features.

Building a learnable universal coordinate system for single-cell atlas with a joint-VAE model

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Multi-output prediction of dose–response curves enables drug repositioning and biomarker discovery

Hot Topics

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Related Articles

Balancing Act: Pregnancy and Bipolar Disorder

Cohesion at the cellular level: flexible yet stable

Gut bacteria influence responses to immunotherapy in patients with asbestos related cancer

Quick Links

Must Read

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis

Chemistry wordoku #062 | Puzzle

Popular Articles

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Poisoning medical knowledge using large language models

Semi-supervised recognition for artificial intelligence assisted pathology image diagnosis