scMaui: a widely applicable deep learning framework for single-cell multiomics integration in the presence of batch effects and missing data | BMC Bioinformatics

scMaui model descriptionVariational product-of-experts autoencodersscMaui assigns a pair of encoder and decoder to each modality as illustrated in Fig. 1, so that assay-specific features can be extracted. In order to create a joint embedding space for all modalities, scMaui deploys additional layers calculating the joint mean and standard deviation of encoded features from all assays based on the PoE approach. We assume that the embedding representation of each assay follows a Gaussian distribution, thus the joint representation calculated as the production of marginal distributions will also be a Gaussian distribution. Building on the approximation of true joint posterior introduced by Wu et al. [50], the joint standard deviation (\({\sigma }_{joint}\)) and joint mean (\({\mu }_{joint}\)) for \(N\) assays are calculated as below:$${\sigma }_{joint} = ({{{\sigma }_{0}}^{-1}+}{\sum }_{i=1}^{N}{{\sigma }_{i}}^{-1}{)}^{-1}$$$${\mu }_{joint}= ({\mu }_{0}{{{\sigma }_{0}}^{-1}+}{\sum }_{i=1}^{N}{{{\mu }_{i}{\sigma }_{i}}}^{-1})({{{\sigma }_{0}}^{-1}+}{\sum }_{i=1}^{N}{{\sigma }_{i}}^{-1}{)}^{-1}$$\({\mu }_{0}\) and \({\sigma }_{0}\) are two parameters of the prior Gaussian distribution explaining the joint embedding space, and we set it to a standard Gaussian distribution, \(N(0, {\rm I})\).Flexible batch effects removal using adversarial learning and covariatesscMaui excludes sources of variations introduced by batch effects using two different techniques. Firstly, vectors storing batch effect factors are fed into the encoders and decoders individually as covariates. Secondly, scMaui uses adversarial learning via additional feed-forward networks and gradient reversal layers [51]. The decoders reconstruct the input assays only from the joint latent factors and batch effect factor information conveyed by adversaries and covariates respectively.Flexibility in batch effect correction is one of the strengths of scMaui. Batch effect factors can be both categorical and continuous values. Although varied statistical methods for batch effect correction have been proposed [52, 53], these commonly allow only categorical batch effect factor values or even require further preprocessing such as separation of input dataset by batches. scMaui supports automatic and simultaneous correction of multiple batch effects [24].Deep neural networks are optimised in a direction towards decreasing the given loss function value over multiple steps. scMaui has a loss function which consists of three terms:$${L}_{total} = {{L}_{kl}({\mu }_{joint}, {\sigma }_{joint} ) +\Sigma }_{i =1…I }({L}_{recon}({Y}^{i}, \widehat{{Y}^{i}})) – {{\Sigma }_{j=1…J}{ {L}_{adv}({b}^{j}), }}$$
(1)
where each assay \({Y}^{i}\in {\mathbb{R}}^{batch\_size \times N}\) has \(N\) features. In scMaui, varied types of reconstruction loss (\({L}_{recon}\)) functions are supported depending on the assay type and the distribution of expression levels. For instance, a negative binomial loss function is reasonable to read count assays whereas methylation data can be better reconstructed by a binary loss function owing to the bimodal distribution of methylation beta values. A full list of supported reconstruction losses is available in Supplementary Methods.For each VAE assigned to an assay \(i\), KL-divergence loss (\({L}_{kl}\)) is calculated as generally done in VAE training [54]:$${L}_{kl}\left({\mu }_{joint}, {\sigma }_{joint}\right)= -\frac{{w}_{kl}}{2}\left({\Sigma }_{l=1…K}\left(1+log\left({{{\sigma }^{l}}_{joint}}^{2}\right)-{{{\mu }^{l}}_{joint}}^{2} -{{{\sigma }^{l}}_{joint}}^{2}\right)\right),$$where \(K\) is the dimension of latent space and \({w}_{kl}\) is the weight of KL-divergence loss given as a hyperparameter.In addition, an adversarial loss term (\({L}_{adv}\)) is incorporated into the total loss function to correct batch effects inherent in the input data. Adversarial learning [55] is a technique to force model training towards increasing certain loss term values on purpose. Since the latent space should ideally contain cellular heterogeneity information correcting all batch effects, the ground-truth batch effect vectors should not be predictable from the embedded latent factors. This idea is applied to scMaui loss function in a way of increasing cross-entropy loss between ground-truth and predicted batch effect values. According to Eq. (1), the model tries to decrease \({L}_{total}\) by increasing \({L}_{adv}\). When total \(J\) different batch effect factors are assigned, a cross-entropy loss \({L}_{adv}\) is calculated for each of those and summed up together at the end. This approach was proven to improve batch effect correction by VAE-based models in our previous work [24].Furthermore, different types of batch effect factors are handled in parallel as separate input vectors of covariates and adversaries. Continuous variables are also supported for batch effect handling in the scMaui algorithm. These provide high flexibility in scMaui downstream analysis for single-cell multiomics data, while users have to preprocess the batch effect information fitting to the required format (e.g. one category of all integrated batch effect factors) when using the majority of other single-cell multiomics integration tools. If a one-hot encoded categorical batch effect factor \({b}^{j}\) containing \(K\) labels is given, the adversarial learning uses a categorical cross-entropy loss function:$${L}_{adv}({b}^{j}) ={-\Sigma }_{k=1…K}{{b}^{j}}_{k}log({{softmax(a}^{j}}_{k}(\mu_{joint}))),$$where \({a}^{j}(\cdot )\) is the adversarial network assigned to the batch effect factor. We note that outputs of the adversarial network are softmax-normalised for the categorical cross-entropy loss. If the batch effect factor \({b}^{j}\) has continuous values, we use the squared error loss:$${L}_{adv}({b}^{j}) ={\Sigma }_{k=1…K}({{b}^{j}}_{k}-{{a}^{j}}_{k}(\mu_{joint}){)}^{2},$$where all notations are the same as the categorical cross-entropy loss function above.Seamless handling of missing datascMaui includes the unique benefit of handling missing data as well. Mosaic data where not all samples have complete modalities can be still modelled by scMaui using the “missingness” indicator. scMaui supports a “mask” layer where users can indicate which modalities or features are absent for each sample, so that loss can be calculated only with measured assays and the robust embedding space can be created despite of incomplete data, using the PoE-based calculation.Data preprocessingFour different single-cell multiomics datasets in various biological scenarios were used to assess scMaui performance. From GEO accession number GSE194122 [25], we downloaded two single-cell bone marrow datasets: one was comprised of RNA-seq and ADTs data, whereas the other included RNA-seq and ATAC-seq. Mouse skin cell SHARE-seq data was downloaded from GEO accession number GSE140203 [26]. The skin cell data set contains RNA-seq and ATAC-seq profiles. Mouse embryo single-cell multiomics data, which consists of RNA-seq and BS-seq, was also collected with GEO accession number GSE121708 [45].Since each assay has its own properties, different preprocessing pipelines were applied accordingly. For RNA-seq and ATAC-seq assays, sequencing read counts were normalised and log1p-transformed using scran [56] and scanpy [57]. Then, only for ATAC-seq assay, which is more sparse than RNA-seq, we excluded features covering less than 5% of the entire cells. ADTs were also normalised with the centred log-ratio transformation.However, for the mouse skin SHARE-seq data, we mainly followed the preprocessing described by scMM paper [18]: choosing the top 5,000 genes and 25% peaks for RNA-seq and ATAC-seq, respectively. Features expressed in less than 10 cells were excluded before choosing the top features.Only for ATAC-seq in the human bone marrow dataset (GSE194122) and BS-seq assays, we filtered out features included neither in promoters nor in enhancer regions. hg38 and mm10 genomes were used for finding promoter and enhancer regions, as a human and mouse reference genome each. Promoters were collected from UCSC dataset, while enhancers were downloaded from FANTOM5 (https://slidebase.binf.ku.dk/human_enhancers/, [58]). After the filtering, we selected only 5,000 most variable features from the assay.The human bone marrow dataset (GSE194122) provides very detailed cell-type labels (45 labels for RNA-seq and ADTs data, 22 labels for RNA-seq and ATAC-seq data). Thus, we further coarsely annotated each cell by grouping given cell-type labels. In this study, the provided cell-type labels and the new group annotations are referred to as subpopulation and population, respectively. Supplementary Tables 1 and 2 show the annotation matches of population and subpopulation in individual datasets. For the benchmarking, we used 84,677 cells in total including 11 populations which comprise 45 subpopulations. These were collected from 8 different donors and processed in 4 sites, creating a multiple batch effect landscape representative of many real-world scenarios.Experimental setupAssignment of batch effect factorsDepending on the single-cell multiomics data collection pipeline, different factors can introduce additional sources of variation within the dataset. For the human bone marrow dataset, two different factors were considered to cause batch effects: donor and site. On the other hand, in the mouse embryo development analysis, only different embryos were regarded as a batch effect factor. All the single-cell multiomics integration methods with the exception of scMaui and sciPENN took one merged vector of batches including both donor and site batch factors, because those accept only a single type of factor for batch effect correction.scMauiAlthough scMaui requires many different hyperparameters regarding VAE setup, we kept a consistent setup for most of those in all analyses. We gave 20 layers with 512 units to the encoder and 1 layer with 20 units to the decoder each. The adversary network was established with 2 layers consisting of 128 units. Only in the input layer, dropout was applied with the rate 0.1. We used 512 as batch size in all model training.The number of latent factors and training epochs were chosen depending on the number of features within the dataset in each analysis. While the human bone marrow gene expression and protein expression analyses used only 28 latent factors due to a much lower number of protein expression features, the other 2 analyses were done with scMaui trained for 50 latent factors. All models were trained over 1500 epochs with the exception of the human bone marrow RNA-seq and ATAC-seq analysis whose training was 2000 epochs owing to the larger number of features and less sparsity in ATAC-seq data compared to the methylation data. Regarding the reconstrufction loss, we gave negative binomial loss to respective assays except for the BS-seq assay in the mouse embryo development analysis. Since methylation data generally presents distinct binary signals, methylated or unmethylated, in single-cell data, we utilised binary reconstruction error.PCAPrincipal component analysis (PCA) is a linear dimensionality reduction algorithm which extracts principal components that explain the most variance. We conducted PCA on each assay independently and extracted 20 components.MOFAMOFA [8] is an R package which can dissect multiomics datasets into integrated factors and provide a low dimensional reconstruction. In this study, MOFA2 version 1.6.0 was used and hyperparameters were chosen based on get_default_model_options, get_default_data_options and get_defalut_training_options functions provided by the package. MOFA also has the option to explain variations across different groups of samples, thus we did 2 separate analyses, with and without the group option, to compare the results in terms of batch effect correction.SeuratSeurat v3 [10] was mainly developed for integrating single-cell datasets generated via different experiments, but we used it in this study focusing more on its capacity to find associations between different assays. We applied sctransform only to the RNA-seq assay in all analyses. In order to handle the batch effect, the Seurat object was split into different batches. Then, we selected integration features with 3000 features, and found integration anchors between different assays. For finding integration anchors, we assigned 100 to the minimum number of neighbours as filtering threshold, ‘rann’ (it refers to as fast nearest neighbour search) to the neighbouring method. Ultimately, we ran PCA on the integrated data and extracted 20 components in each analysis.In the mouse skin SHARE-seq benchmarking, more recently published versions of Seurat were applied. Seurat v4 introduced weighted-nearest neighbour (WNN) to find a joint distribution of multiple modalities of omics data to define cellular state [31], whereas a new pipeline to integrate scRNA-seq and scATAC-seq data was revealed in Seurat v5 tutorials [32]. In the Seurat v5 tutorial, Signac [59] is used to estimate the transcriptional activity based on ATAC-seq counts and the integration is done using canonical correlation analysis (CCA). For the benchmarking, we integrated the RNA-seq and ATAC-seq data from the mouse skin data set following these two pipelines in their tutorials (https://satijalab.org/seurat/archive/v4.3/weighted_nearest_neighbor_analysis and https://satijalab.org/seurat/articles/seurat5_atacseq_integration_vignette), separately. The results are annotated as Seurat_wnn and Seurat_cca, respectively.totalVI and MultiVItotalVI [9] is a VAE-based method for integrating specifically CITE-seq data, which is comprised of transcriptomes and epitopes, so it was only tested with the human bone marrow RNA-seq and ADTs data. We used totalVI included in python package scvi version 0.17.1. Batch label was given as the combination of donor and site. We set up batch size as 256, the ratio of training samples as 0.8 and the learning rate as \(4\times {10}^{-3}\). The training was done over 300 epochs, and afterwards, we ran the posterior modelling with batch size 32. Otherwise, we followed up on all default hyperparameter setups described in the scvi-tools package tutorial (https://docs.scvi-tools.org/en/0.6.6/index.html).MultiVI [30] is another VAE-based single-cell multiomcs integration method provided in the scvi-tools package. Since the targeted data types for the mouse SHARE-seq benchmarking (RNA-seq and ATAC-seq) are not supported by totalVI, we replaced totalVI with MultiVI. When it comes to hyperparameters and pipelines, we followed the tutorial provided by the authors (docs.scvi-tools.org/en/stable/tutorials/notebooks/multimodal/MultiVI_tutorial.html).scMMscMM [18] adopts the mixture-of-experts VAEs to integrate either RNA-seq and ATAC-seq or two modalities of CITE-seq. We downloaded the source code from the GitHub repository (https://github.com/kodaim1115/scMM) and used the main.py file for our benchmarking experiments. 30 and 40 latent factors were used for human bone marrow CITE-seq data and mouse skin SHARE-seq data each.sciPENNsciPENN [60] integrates CITE-seq data and imputes the missing values by primarily using feed-forward neural networks and recurrent neural networks. Since the method is specifically designed for the CITE-seq technology, we added sciPENN only in the human bone marrow CITE-seq data benchmarking. The baseline tutorial provided by the authors (https://github.com/jlakkis/sciPENN) was followed to run sciPENN for the benchmarking.MowgliMowgli [61] is a non-negative matrix factorisation (NMF) model combined with optimal transport for single-cell multiomics integration. Since Mowgli only performed reasonably with a lower number of features during the experiments, we selected the top 500 highly variable genes for RNA-seq in the human bone marrow CITE-seq data. For the mouse skin SHARE-seq data, we chose the top 800 features for each modality. The number of latent factors was 25 and 40 for the human bone marrow data and mouse skin SHARE-seq data integrations each.Performance measuresLouvain clusteringThe Louvain community detection algorithm identifies clusters based on the relative density of connections between the inside and the outside of each cluster [62]. This method has been applied to a broad range of single-cell analyses [63]. We clustered latent factors with the Louvain clustering algorithm and assessed the performance of respective methods.Adjusted mutual informationMutual information (MI) measures the similarity between ground-truth clusters and predicted clusters. Given \(n\) ground-truth clusters \(U = \{{U}_{1}, {U}_{2}, …, {U}_{n}\}\) and \(m\) predicted clusters \(V = \{{V}_{1}, {V}_{2}, …, {V}_{m}\}\), MI between two clusters is given as:$$MI\left(U, V\right)= {\sum }_{i=1}^{n}{\sum }_{j=1}^{m}\frac{\left|{U}_{i} {\cap }{V}_{j}\right|}{n + m} log\frac{\left(n+m\right)\left|{U}_{i} {\cap }{V}_{j}\right|}{\left|{U}_{i}\right|\left|{V}_{j}\right]}; \left|{X}_{t}\right|:= Number\, of \,elements \,in \,cluster {X}_{t}.$$However, MI value tends to be higher when the number of clusters is larger. Therefore, we used adjusted mutual information (AMI) in our evaluation, which takes account of chance as follows:$$AMI(U,V) =\frac{MI(U, V) – E(MI(U, V))}{max(MI(U, V)) – E(MI(U, V){)}}; E(x) = Expectation\, of\, x.$$Adjusted rand indexRand index (RI) is a statistic measuring the similarity between two different clusters. In our analysis, it was used for comparing predicted clusters and ground-truth clusters. RI is calculated as follows:$$RI = \frac{TP + TN}{TP + FP + FN + TN}$$where \(TP , TN, FP, FN\) refer to true positive, true negative, false positive, and false negative, respectively. Adjusted rand index (ARI) is the corrected version of RI for chance and makes the measurement independent of the number of clusters. ARI equation is defined as:$$ARI =\frac{RI – E(RI)}{max(RI) – E(RI{)}}; E(x) = Expectation\ of\ x.$$Clustering purityWe also calculated the purity of the most dominant class in each cluster. When \(C\) different ground-truth classes are grouped into one cluster, the clustering purity is calculated as follows:$$Clustering\, Purity = \frac{max({n}_{i})}{{\prod }_{i=1}^{C}{n}_{i}}; i = 1, 2, …, C.$$\({n}_{i}\) means the number of elements from the class \(i\).Silhouette scoreFor silhouette score values, we used the silhouette coefficient which measures the similarity of each element to the cluster where it belongs, compared to the closest neighbouring cluster. In single-cell analysis, it often represents the separation of clusters meaning that higher value indicates better separation. For each cell \({c}_{i}\), the silhouette score is calculated as below:$$Silhouette Score = \frac{b({c}_{i}) – a({c}_{i})}{max(b({c}_{i}) ,a({c}_{i}))}$$where \(b({c}_{i})\) refers to the mean distance between the given cell \({c}_{i}\) and the nearest neighbouring cluster, and, \(a({c}_{i})\) is the mean distance between the given cell \({c}_{i}\) and all other cells within the same cluster.

Hot Topics

Related Articles