Modal-nexus auto-encoder for multi-modality cellular data integration and imputation

Model architecture and experimentsMonae is an unsupervised learning framework composed of multiple autoencoders, as shown in Fig. 2. Monae builds a coarse-grained graph (guidance) based on the regulatory relationships between different modalities. The nodes in the graph correspond to an independent feature in a certain modality. The graph VAE in Monae learns the representation of each node, depicted in Fig. 2a. Within Monae, distinct autoencoders represent individual modalities, and their asymmetric networks generate the positive and negative samples essential for contrastive learning (Fig. 2b). Contrastive learning within Monae is enhanced by the incorporation of adaptive clustering, augmenting the distinction among cell clusters (Fig. 2c). Monae fuses cell representation from the joint space with modal-nexus node features to compute imputation counts. Through iterative epochs, Monae acquires graph-linked and contrastive cell embeddings, with the latter serving as the integrative cell representation (Fig. 2d). Furthermore, we add multi-modality cells (and optional item: motifs entities) as nodes to the graph-linked guidance in Fig. 2a. We call this variant Monae-E (Monae-Extension). We can obtain the embeddings of all nodes from Graph VAE, including cell embeddings and feature embeddings. We use cell embeddings to replace the count’s input of the asymmetric network, and the subsequent process is consistent with Monae, as shown in Fig. 2e. Details of Monae and Monae-E are given in Methods. Fig. 2: The overall architecture of Monae.a Graph-linked embedding learning. The coarse graph-linked guidance is constructed from multi-modality relationships, and a graph encoder learns the feature embedding of multi-modality nodes. A graph decoder then generates the fine graph-linked guidance. b Construction of positive and negative samples. Asymmetric networks for each modality (e.g., scRNA-seq, scATAC-seq) transform cells into embeddings. Positive sample pairs are formed from teacher and student branches. c Contrastive cell embedding learning. Positive sample pairs are pulled together while negative samples are pushed apart through contrastive learning, yielding discriminative cell embeddings. d The learned contrastive cell embeddings serve as integration representations for unsupervised clustering. They are also combined with the multi-modal feature (node) embeddings to reconstruct imputation counts. e Further extending (a) to build Monae-E (Monae-Extension). Cells are added as nodes to the graph-linked guidance, and Graph VAE outputs all node embeddings. The following process is the same as (b), (c), and (d). Monae-E uses the cell embedding as the input of the asymmetric network and the feature embedding to reconstruct the input.For integration, we use the following datasets: 10x-Multiome30: the 10x-Multiome technology was employed to jointly (simultaneously) measure a total of 9631 human peripheral blood mononuclear cells. Chen-201931: the SNARE-seq technology was employed to jointly measure a total of 9190 cells in the mouse cortex (but we shuffled the pairing of cells). Ma-202032: the SHARE-seq technology was employed to jointly measure 32,231 cells in the mouse skin. In jointly measured datasets, every individual cell is matched uniquely with one other cell from a different modality. For example, in the 10x-Multiome dataset: the RNA modality of 9631 cells ATAC modality 9631 cells. Muto-202133: the snRNA-seq and the snATAC-seq technologies were employed to measure a total of 44,190 cells from human kidneys. There are 19,985 cells in the RNA modality and 24,205 cells in the ATAC modality. In particular, for Ma-2020 and Muto-2021, there are batch effects within each modality. The single-modality of Ma-2020 and Muto-2021 consists of 4 batches and 5 batches, respectively. Mouse cortex dataset34,35,36: the dataset comprises three distinct modalities derived from neuronal cells within the adult mouse cortex. These modalities include the ATAC modality, RNA modality, and methylation modality, which correspond to chromatin accessibility, gene expression, and DNA methylation, respectively. These modalities were measured with the Drop-seq technology, the 10x ATAC technology, and the snmC-seq technology, respectively. Human 15 organs37,38: a large-scale dataset with 4,062,980 cells for RNA modality and 720,613 cells for ATAC modality. For imputation, we use the previously introduced 10x-Multiome, Chen-2019, Ma-2020, and Muto-2021. We run PCA based on imputation counts of 10x-Multiome, Chen-2019, Ma-2020, and Muto-2021 and use PCA embedding to compare various imputation methods. The statistics of datasets are shown in Supplementary Table 1. See Supplementary Text 1 and Supplementary Fig. 1 for data preprocessing.We use the following baseline methods for both integration and imputation: scButterfly29, MultiVI11, and JAMIE12. In addition, we compare baseline methods dedicated to integration, including the recent GLUE39 and CoVEL13, and other classical methods: Seurat8, Harmony10, LIGER9, bindSC40, iNMF41, and unioncom42. We compare baseline methods dedicated to imputation, including intra-modal imputation: DCA26 and MAGIC19, and cross-modal imputation: UnitedNet17 and BABEL28. The summary of integration and imputation methods is shown in Supplementary Table 2. An overview of related work is given in related work. Following the integration benchmark39, we use the overall score to evaluate the integration performance, which is composed of “batch removal” and “biology conservation.” The two metrics are composed of (SAS: Seurat alignment score, omics layer ASW, GC: graph connectivity) and (MAP: mean average precision, cell type ASW, NC: neighbor consistency) respectively. Following the imputation benchmark27,29, we use AMI (adjusted mutual information), ARI (adjusted rand index), HOM (homogeneity score), and NMI (normalized mutual information) as metrics. See Supplementary Text 2 for details of evaluation metrics. For all benchmark experiments, we follow setup39 and randomly split 10% of the cells as a validation set for stopping training. If the validation loss does not decrease for 5 consecutive epochs, the learning rate is multiplied by 0.1. If the validation loss still does not decrease after 10 consecutive epochs, the training will be terminated. All methods are evaluated on 9 different splits.Multi-modality integrationWe evaluate proposed methods and baselines on Muto-2021, a totally unpaired and more challenging dataset. Effective integration methods should cluster the cells of the same type across modalities. we utilize UMAP43 to visualize the combined embedding across the two modalities (Fig. 3a). CoVEL, Monae, and Monae-E achieve better performance. In addition, Monae and Monae-E exhibit excellent biological heterogeneity because different cell clusters are separated from each other. We quantitatively evaluated the multi-modality integration baselines and our methods using the “batch remove” metric and “biology conservation” metric. Compared with integration baselines, our methods achieve better performance on these two metrics and are always robust to different data splittings (see Fig. 3b). To evaluate the impact of dataset scale (size), we first randomly sample and obtain multiple dataset subsets of different scales. All methods are compared on the subsets (Fig. 3c, d). We find that GLUE and scButterfly can still adapt to small-scale subsets, but Monae and Monae-E can achieve better performance. In addition, as the scale of the subset increases, Monae and Monae-E will have more advantages. More comparisons of results are presented in Fig. 3e (metric is an overall score). In addition, among these baselines, the simultaneous integration and imputation methods are not as good as the methods dedicated to integration, while Monae and Monae-E are close to or even better than these methods dedicated to integration.Fig. 3: Multi-modality integration results.a UMAP visualization of different methods on Muto-2021. Colored by modality and cell type annotations. b Integration performance comparison on Muto-2021 (the balance between batch removal and biology conservation). There are n = 9 repeats with different random seeds for data splitting. The error bars indicate mean ± s.d. c, d Robustness validation on subsampled Muto-2021 of various scales. Integration performance trends as sample size increases. There are n = 9 repeats with different random seeds for data splitting. The error bars indicate mean ± s.d. e Comparison of integration results on four datasets, where the metric is the overall score. There are n = 9 repeats with different random seeds for data splitting. The error bars indicate mean ± s.d. Source data are provided as a Source Data file.UMAP visualization comparisons on more datasets are shown in Supplementary Fig. 2. In addition, we measure 6 individual metrics (NC, MAP, cell type ASW, SAS, omics layer ASW, GC) that make up the overall score on the four datasets, as shown in Supplementary Fig. 3. We compare Monae and Monae-E with classic integration methods, as shown in Supplementary Fig. 4. Our methods still show advantages on four datasets.Batch integration in multi-modalityFor a single modality in multi-modality data, multiple batches of measurements can lead to batch effects within a single modality. Monae eliminates challenges such as batch effects in the process of contrastive learning. Taking the cell embedding of the Teacher branch as the cell representation, each cell is assigned a positive sample. Positive samples come from cell embeddings of the Student branch. In the real world, although no identical cells exist, representations of cells of the same type are closer than representations of cells of different types. Therefore, Monae clusters Teacher embeddings and Student embeddings of the same cell together, excluding the representation of any other cells. Eventually, similar cells clump together because they are of the same type. Monae has batch integration as well as multi-modality integration. We evaluate the batch integration performance of Monae, Monae-E, and three representative methods on Ma-2020 and Muto-2021 (Supplementary Fig. 5). There are 4 and 5 batches, respectively, in a single modality of Ma-2020 and Muto-2021. It is known that the adversarial learning strategy in GLUE can achieve excellent batch integration. Monae and Monae-E achieve batch integration performance similar to GLUE. In addition, Muto-2021 has obvious batch effects, and the SAS (Seurat alignment score) of Monae exceeds that of GLUE. The visualization of Monae batch integration results is shown in Supplementary Fig. 6.Interpretable integration resultsMonae facilitates the integration of multiple modalities, including but not limited to three distinct modalities. In our study, we employed Monae to integrate the Mouse Cortex dataset, which encompasses three modalities: RNA, ATAC, and Methylation. Each modality is associated with different criteria for cell type annotation, specifically, the RNA modality identifies 8 cell types, the ATAC modality identifies 10 cell types, and the Methylation modality identifies 16 cell types. It is important to note that these three modalities provide different phenotypic views of the mouse cortex. Consequently, we can select a single modality classification criterion to annotate single-cell samples. In Supplementary Fig. 7a, we provide the UMAP visualizations derived from the classification criteria specific to the RNA modality. The results indicate that Monae successfully performs integration. However, as illustrated in the left section of Supplementary Fig. 7a, clusters such as the one corresponding to the cell type CGE exhibit insufficient density. We can obtain fine-grained cell subtypes based on these clusters.Previous studies have pointed out34,35 that under the RNA modality classification criterion, MGE and CGE can be divided into two subtypes; under the ATAC modality classification criterion, L6-IT and Vip can be divided into two subtypes. In this study, we can pay additional attention to the rare cell type mDL-3. Since the Methylation modality classification criterion has fine-grained categories, we take Methylation data as the reference dataset. Let the cell embeddings of the RNA modality and ATAC modality be used as query datasets, and we use kNN to implement automatic annotation. For convenience, we merge mSst-1 and mSst-2 into mSst and merge mNdnf-1 and mNdnf-2 into mNdnf in the Methylation modality. The complete cell classification results are shown in Supplementary Fig. 7b.The flow of categories is quantified from Supplementary Fig. 7a, b, in order to derive Supplementary Fig. 7c. Cells associated with mPv and mSst are designated with a light blue flow, while those related to mNdnf and mVip are marked with a light pink flow, and cells pertaining to mDL-3 are indicated with a light green flow. The findings presented in Supplementary Fig. 7c demonstrate that the joint embedding generated by Monae aligns with established knowledge, thereby confirming the interpretability of integration results.Integration on large-scale dataLarge-scale datasets contain a rich variety of cell types and, thus, greater biological heterogeneity. Therefore, large-scale data scenarios are a challenge for multi-modality integration. We integrated Human 15 organs (RNA modality: 4,062,980 cells, ATAC modality: 720,613 cells) using GLUE, CoVEL, Monae, and Monae-E. We first evaluate the efficiency of these methods on datasets of different scales, including the time spent on unsupervised training and memory usage (RES and MEM), as shown in Fig. 4a. Due to the additional biological information, Monae-E can converge significantly faster, and it takes the least training time. It also means that Monae-E needs more memory to process the additional information, but Monae-E’s memory requirements are much lower than CoVEL. Considering the efficiency of time and memory, Monae-E can be used as practical software.Fig. 4: Large-scale multi-modality data integration comparison.a The efficiency of GLUE, CoVEL, Monae, and Monae-E on datasets of different scales, including the time spent on unsupervised training and memory usage (RES and MEM). b UMAP visualizations of GLUE, CoVEL, Monae, and Monae-E. The first row of visualizations is colored by modality (omics layer) categories, highlighting the integration results of scRNA-seq and scATAC-seq. The second row displays the same integration results colored by cell types to demonstrate the distinct clustering. Dashed circles in the subfigures indicate areas of interest where specific cell types serve as a qualitative assessment of the integration performance. c Modality labels and cell type labels in the Human 15 organs dataset. Source data are provided as a Source Data file.The UMAP visualization results of GLUE, CoVEL, Monae, and Monae-E are shown in Fig. 4b. In Fig. 4b, we have selected two cell types, Hepatoblasts and Cardiomyocytes, with blue and black dashed circles, respectively. For Hepatoblasts, Monae and Monae-E aligned the two modalities, and the two modalities corresponding to GLUE and CoVEL are still separated, as seen in the first row of Fig. 4b. For Cardiomyocytes, compared with GLUE and CoVEL, the cell clusters corresponding to Monae and Monae-E are better separated from other cell clusters, as seen in the second row of Fig. 4b. In addition, we compare the classical methods, the UMAP visualizations of Seurat and iNMF are shown in Supplementary Fig. 9. Seurat and iNMF obviously did not achieve good integration performance. We visualized the expression of marker genes on the integration results of Monae. The marker genes showed high expression on specific cell clusters, thereby verifying that the integration results were biologically meaningful (see Supplementary Fig. 10).Imputation of intra-modalityRNA modality imputation is often referred to as denoising, corresponding to intra-modal imputation. Our intuition for using the imputation of Monae and Monae-E comes from: for the RNA modality, the measurement techniques will lead to many zero values in the counts, and we use the cell representation information from other modalities (such as ATAC modality) to fill in the cell representation from the RNA modality. Therefore, multi-modality integration provides us with modal complementary cell representations that can be used to generate realistic counting signals.We use MultiVI, as well as methods dedicated to intra-modal imputation: MAGIC and DCA as baselines. We evaluate baselines and our methods on Muto-2021. Following the benchmark’s standard27, the quality of the imputation results can be reflected by the clustering (PCA dimensionality reduction on raw counts or imputation counts), because there are too many zero-value signals (dropout) in the raw counts, which can reduce the biological heterogeneity of the cellular PCA embedding. As a result, the differences between clusters are not significant enough, as shown in ’Raw’ of Fig. 5a. In Fig. 5a, the clusters of cells after Monae and Monae-E imputation are more separated from each other, and thus our methods effectively fill in the dropout in the raw data. The quantitative results are shown in Fig. 5b, where the overall performance of Monae and Monae-E outperforms baselines. In addition, Supplementary Figs. 11, 12 show the RNA modality imputation results (intra-modal imputation) on 10x-Multiome, Chen-2019, and Ma-2020.Fig. 5: Intra-modal imputation and cross-modal imputation.a UMAP visualization of intra-modal imputation counts on Muto-2021. PCA is performed on the raw counts and imputation counts, followed by UMAP with cell-type labels. b Intra-modal imputation results on Muto-2021. There are n = 9 repeats with different random seeds for data splitting. The error bars indicate mean ± s.d. c Cross-modal imputation results on Muto-2021 (RNA to ATAC). There are n = 9 repeats with different random seeds for data splitting. The error bars indicate mean ± s.d. d Cross-modal imputation results on Muto-2021 (ATAC to RNA). There are n = 9 repeats with different random seeds for data splitting. The error bars indicate mean ± s.d. e Cross-modal imputation gene expression heatmap corresponding to Monae (top) and scButterfly (bottom), with rows representing cells and columns representing genes. f Cell-level (left) and gene-level (right) PCC between the imputation and raw counts. g Cell-level (left) and gene-level (right) PCC between the imputation and MultiVI counts. Source data are provided as a Source Data file.Imputation of cross-modalityCertain properties of ATAC modality data make analysis more difficult compared to RNA modality. The read counts of ATAC modality are very low, resulting in very sparse data. Since the ATAC modality is sparser than the RNA modality, the difference in cell clusters of the former is significantly lower than that of the latter. Therefore, the clustering metrics of the ATAC modality are all lower than those of the RNA modality. To improve the biological heterogeneity of ATAC modality data, one approach is to impute data within the ATAC modality, which is the same approach as using Monae or Monae-E to impute data within the RNA modality. The other is to directly generate data in the ATAC modality based on the RNA modality. For the latter, see Generation of imputation counts. The cross-modality imputation can also be called modality prediction or modality translation. In practical applications, when facing multi-modality data with unmatched cells, Monae’s and Monae-E’s cross-modality generation can complete unpaired multi-modality samples, thereby saving the high cost of joint measurements.We use scButterfly, MultiVI, JAMIE, UnitedNet, and BABEL as baselines. We translate RNA modality data on Muto-2021 to ATAC modality. The evaluation results of cross-modal imputation are shown in Fig. 5c, and the results on 10x-Multiome, Chen-2019, and Ma-2020 are shown in Supplementary Fig. 14. Excellent imputation performance is achieved for both Monae and Monae-E. For these four datasets, the UMAP visualization of translation from RNA modality to ATAC modality is shown in Supplementary Fig. 13, and the proposed methods recover the biological heterogeneity under ATAC modality. Overall, Monae and Monae-E translated atlases proficiently characterize differences between cells and have the potential to reduce noise in the original modality and facilitate cell type identification. Similarly, we evaluate the imputation results from the ATAC modality to the RNA modality. The evaluation results of cross-modal imputation on Muto-2021 are shown in Fig. 5d. While scButterfly performs better, it is worth noting that Monae and Monae-E achieve performance close to that of scButterfly. The UMAP visualization of translation from ATAC modality to RNA modality is shown in Supplementary Fig. 15. The results on 10x-Multiome, Chen-2019, and Ma-2020 are shown in Supplementary Fig. 16. Monae achieves outstanding performance on these three datasets.Furthermore, we explore the gene expressions translated from ATAC modality to RNA modality, and we use the paired dataset 10x-Multiome. Figure 5e shows the predicted gene expression heatmap: Monae (top) and scButterfly (bottom). In the heatmap, rows represent cells, and columns represent genes. We used scanpy44 to sort the gene order, the highlighted diagonal line shows the biological heterogeneity of the imputation counts matrix. In Fig. 5e, compared with scButterfly, Monae can restore the discrimination of rare cell types (such as Treg, gdT). Referring to UnpairReg45, we evaluate the accuracy of each cell gene expression prediction by using raw counts (normalized) as ground truth. As shown in Fig. 5f, scButterfly has higher PCC at the cell level and gene level, but note that Monae’s PCC is not much behind. Considering that raw counts are always affected by dropout, we use imputation counts from MultiVI as ground truth and compare Monae and scButterfly again. As shown in Fig. 5g, the results are further improved, and Monae’s PCC exceeds scButterfly. In the same way, we compare Monae-E with scButterfly, as shown in Supplementary Fig. 17.Biological discovery application of Monae-EWe validate Monae-E’s ability to discover biological insights on Muto-2021. The following facts are known: HNF4A encodes a transcription factor (TF) that drives the differentiation of proximal tubules (PT)46; the AP-2 transcription factor family includes proteins encoded by TFAP2A and TFAP2B, which respectively regulate the development of the distal convoluted tubule (DCT) and the thick ascending limb (TAL)47; PT_VCAM1 cells are a subset of proximal renal tubular epithelial cells that express VCAM133.During graph-linked embedding learning, cells are added as nodes, allowing us to obtain joint embeddings of cells and genes, as visualized in Fig. 6a. All four marker genes appear close to the corresponding cell types. This result demonstrates that Monae-E correctly preserves cellular heterogeneity and retains relevant genes for specific cell types. In addition, we visualize the predicted gene expression patterns of these four marker genes on Monae-E’s imputation results. As shown in Fig. 6b, Monae-E’s imputation results also retain the differential expression patterns of marker genes (highlighted in specific cell clusters).Fig. 6: Biological discovery application of Monae-E.a UMAP visualization of cells and genes joint embeddings. b Marker gene expressions on imputation counts. c Regulatory relationships between VCAM1 and similar peaks. The color of the link indicates the strength of the relationship. d Pseudo-time of cell clusters. e Gene expression and peak expression in the path: PT (orange) to PT_VCAM1 (green). f TF-target genes network of HNF4A (orange node). Target genes supported by research are represented by pink nodes. Target genes for which no research support was found are represented as blue nodes. g For the blue nodes in (f), we choose the top-5 genes. The expression patterns of these candidate genes in the two modalities are consistent with HNF4A.Furthermore, we verify the cross-modal feature regulation relationships based on the feature embedding distances of Monae-E. Taking the marker gene VCAM1 of the cell cluster PT_VCAM1 as an example, we retrieve the three closest peaks to the marker gene and find that these three peaks are also physically close (genomic coordinates) in physical distance (Fig. 6c). For the most similar peaks: chr1:100719411-100719996, which is a region located about 60 kb upstream of the VCAM1 gene and has been confirmed to interact with the VCAM1 promoter33.In addition, it is known that PT cells may differentiate into PT_VCAM1 cells under specific conditions, participating in the repair process of kidney damage. We select cell clusters PT and PT_VCAM1 and calculate pseudo-time based on Monae-E imputation counts, with results consistent with the known differentiation path (Fig. 6d). Moreover, we find that chr1:100719411-100719996 and VCAM1 are co-upregulated in the path from PT to PT_VCAM1 (Fig. 6e). This result once again verifies the regulatory relationship between peaks and genes (known research: RELA binding sites are located in the chr1:100719411-100719996 region. During the transformation of PT cells into PT_VCAM1 cells, RELA acts as a transcription factor in the activation of the VCAM1 gene48).Based on the distance of feature embeddings, we can use Monae-E to infer TF-target genes (see Usage of Monae and Monae-E for details). Selecting HNF4A as the TF, Monae-E outputs the top 40 target genes, and we find existing research46,49,50,51,52,53,54 to support 17 pink nodes (Fig. 6f). For the blue nodes (no research support found), we choose the top 5 genes. The expression patterns of these five candidate genes in the two modalities are consistent with HNF4A (Fig. 6g). The above results verify Monae-E’s ability in biological discovery.

Hot Topics

Related Articles