Evaluating batch correction methods for image-based cell profiling

Selection of batch correction methods and evaluation strategiesA major part of our work was to comprehensively survey methods for batch correction, as well as strategies for their evaluation. Given the rapid advancements in the field of scRNA-seq, particularly in the development of methods to address batch correction, we focused our attention on this area. We decided to test a subset of the better-performing methods identified in a recent analysis of scRNA-seq batch correction methods17,19. These methods were available in Python or R and required no additional metadata. Additionally, the chosen methods were representative of different approaches and included linear methods (Combat27 and Sphering28), neural-network based methods (scVI29 and DESC30), a mixture-model based method (Harmony24), and nearest neighbor-based methods (MNN31, fastMNN32, Scanorama33, Seurat-CCA25, and Seurat-RPCA26). We excluded methods like BBKNN34 that do not correct the underlying profiles (Supplementary Table 3).We will briefly summarize the main characteristics of these methods, to enable the reader to place our results in the appropriate context. Combat27 models batch effects as multiplicative and additive noise to the biological signal and uses a Bayesian framework to fit linear models that factor such noise out of the readouts. Sphering28 computes a whitening transformation matrix35 based on negative controls and applies this transformation to the entire dataset. It requires every batch to include negative control samples for which variations are expected to be solely technical. scVI29 is a variational autoencoder model for scRNA-seq data and learns a low-dimensional latent representation of each input that reduces batch effects. DESC30 trains an autoencoder along with an iterative clustering algorithm to remove batch effects and preserve biological variation, and requires the knowledge of the biological variable of interest as input, which may be unknown at the batch correction stage. Harmony24 is an iterative algorithm based on expectation-maximization that alternates between finding clusters with high diversity of batches, and computing mixture-based corrections within such clusters. The MNN, fastMNN, Scanorama, and two Seurat methods all rely on identifying pairs of mutual nearest neighbor profiles across batches, and correcting for batch effects based on differences between these pairs. MNN31 was the first implementation of this concept, with each of the other methods presenting adaptations optimized for specific applications. fastMNN32 is a computationally-efficient implementation for batch correction of large datasets by some of the original MNN authors, where the main speed gains are achieved by performing PCA prior to finding the nearest neighbors. Scanorama33 is optimized for large, heterogeneous datasets by finding approximate nearest neighbors within a low-dimensional space, and by finding nearest neighbors across all datasets instead of between each pair of datasets. This relaxes the assumption that each pair of datasets must contain at least one common sub-population and is supposed to prevent overcorrection. The Seurat methods each search for neighbors within some joint low-dimensional space (Seurat-CCA25 defined by canonical correlation analysis and Seurat-RPCA26 defined by reciprocal PCA). CCA makes stronger assumptions about shared sub-populations and performs well when the cell state/type composition is similar between datasets, while RPCA allows for more heterogeneity between datasets and is faster36 for large datasets (Supplementary Fig. 7). The nearest neighbor methods differ across many specific implementation details and in whether they return batch corrected data in the original feature space or in some low-dimensional latent space. We refer interested readers to the original publications for more details. All tested batch correction methods except Sphering require batch labels, Sphering alone requires negative control samples, and only DESC additionally requires biological labels. FastMNN, MNN, Scanorama, and Harmony necessitate recomputing batch correction across the entire dataset whenever new profiles are incorporated. While Sphering, Combat, scVI, and DESC don’t require recomputation, they don’t guarantee performant corrections for new profiles from an unseen source.Developed initially for scRNA-seq data, these methods also apply to morphological profiles, despite inherent biological and statistical differences. Most foundational assumptions about the methods, including the use of vector space metrics to reveal similarities, remain valid in the image-based profiling domain (See Methods: Distributional assumptions). However, we note a crucial difference in the manner we have applied these methods. Given the sheer volume of data in large image-based profiling datasets, which may contain billions of single cells (Supplementary Table 1) compared to the millions typically found in scRNA-seq, it is computationally impractical to apply these methods at the single-cell level. Importantly three out of the ten methods require computing batch correction across the entire dataset. This means that subsampling, a strategy often used to manage large datasets, is not a viable option for those. Thus, we evaluated these methods based on their ability to correct batch effects in population-averaged well-level (or pseudo-bulk) profiles, rather than single-cell level profiles. Importantly, this shift does alter the distribution of the features, but we believe this is an acceptable trade-off given the computational constraints and the overall goal of correcting batch effects at a broader level.Beyond batch correction methods for scRNA-seq, we also reviewed those specifically designed for image-based profiling data analysis. A small handful of past research in this domain incorporated a step for batch correction. Such approaches can be split broadly into two categories: those based on (a) pre-computed feature transformation and (b) representation learning.Pre-computed feature transformation approaches learn a transformation of features that have been extracted from images. In such workflows, common normalization steps have been described37 to deal with interplate and intraplate normalization – these aim to reduce local variances, but they are limited when technical variations are strong. Sphering28 is the most used batch correction method for feature-transformation-based profiles widely applied in Cell Painting pipelines9,38 and was included in our testing.Next, we considered which dataset to use as a benchmark in our evaluation. RxRx139 and RxRx340 datasets are resources for image-based profiling, with millions of images associated with thousands of compounds, but derive from a single, highly-quality-controlled laboratory so cannot be used to assess methods aiming to correct more dramatic batch effects. We chose the JUMP Cell Painting Consortium data specifically because it originated from a diverse range of laboratories using different instruments and protocols, thus capturing the heterogeneity typical of large public datasets. Unlike other public resources that contain data from a single source, the diversity of data sources in the JUMP dataset provides a robust testbed to develop methods that can generalize to other heterogeneous datasets. It also allows mimicking a situation where an individual laboratory might attempt to align their data with public data collected in multiple laboratories. We assess five batch correction scenarios with increasing technical heterogeneity where each scenario represents a more challenging batch correction task.We computed four metrics that report the effectiveness in removing batch effects and six metrics to measure how well the correction preserves biological information (Fig. 1), previously reported in scRNA-seq benchmarks16,17,19. The list of the ten quantitative metrics we used in this study are described in the Metrics section. We also use UMAP41 visualizations as a qualitative tool for assessing batch correction effectiveness.Fig. 1: Evaluation pipeline.We evaluated five image-based profiling scenarios with different image acquisition equipment (high-throughput microscopes), laboratory, number of compounds and number of replicates. We used a state-of-the-art pipeline for image analysis. We compared ten batch correction methods using qualitative and quantitative metrics.Scenario 1: Single microscope type, single laboratory, multiple batches, few compounds, multiple replicatesIn this scenario, we analyzed 302 landmark compounds present on the Target2 plates (see Methods: Dataset description), where each compound is present in each of the 13 experimental runs/batches produced by a single laboratory (source_6). There were a median of 21 replicates per compound. Given that profiles were generated in the same laboratory, with many replicates and relatively low technical variance, this simplified scenario helped us establish a baseline for the best possible results while also guiding the pipeline that could be applied across all more complex scenarios. We considered the experimental run as the batch variable because it is the most dominant confounding source within a single laboratory’s data.Image-based profiling requires carefully designing feature extraction and preprocessing pipelines to accurately represent the biological content of images. Preprocessing refers to the transformation and filtering steps that prepare raw data for in-depth analysis42. Our preprocessing pipeline comprised four steps: (1) low-variance feature removal; (2) median absolute deviation normalization43, which rescales plate-wise individual features; (3) rank-based inverse normal transformation44, which transforms a variable into one that more closely resembles a standard normal distribution; and (4) feature selection, which removes redundant features using by correlation thresholding. See Methods: Preprocessing pipeline for details. We refer to this processed representation as the Baseline.To quantify batch correction methods, we examined two kinds of metrics16,17,19: (1) batch removal metrics that capture how well a method removes confounding variation from data; and (2) conservation of biological variance, termed here as bio-metrics for short, that capture the ability to preserve the variation related to a known biological variable (e.g, chemical compound in our case). There is a trade-off between bio-metrics and batch removal metrics. A method could remove all of the confounding batch variation but simultaneously destroy the biological signal in the data. Also notice that some of these metrics are sensitive to the number of samples per concept of interest (e.g. compound), others focus on a notion of local neighborhood, and others consider more global arrangement. Because different metrics capture different aspects of the correction procedure, and no individual metric captures both effects, we took them all into consideration when making comparisons between multiple batch correction methods. The average of such metric scores has been shown to be a reasonable ranking criterion17,19. To make interpretation easier, every metric here was normalized between 0 and 1, with 0 being the worst performance and 1 being the best. These quantitative metrics (Supplementary Fig. 1) guided the selection of the preprocessing steps to be used as our Baseline approach, in agreement with established protocols in the field43. All remaining results in this paper apply this four-step preprocessing.Following the preprocessing, we applied each of the ten batch correction methods described above that have previously been identified as top-performing methods when applied to scRNA-seq data. We reiterate that we used pseudo-bulk profiles and did not attempt batch correction at the single-cell level due to the computational time required to process up to billions of cells included in a typical image-based profiling experiment. Qualitatively, the 2D projection of the profiles after applying ten different methods, in addition to our Baseline, shows no clusters associated with any particular batch (Supplementary Fig. 2). This suggests that all the methods were successfully mixing the profiles from different batches. However, we noticed that Harmony, fastMNN, and the Seurat methods better grouped the data points associated with the same compound than others, suggesting that these methods are more effective at preserving biological information. Here, positive control compounds exhibited better clustering (see Methods: Dataset description), which was expected as these compounds were chosen based on the strong phenotype in previous Cell Painting experiments45.When evaluating the quantitative metrics (Fig. 2), all ten methods showed similar performance overall. In comparison to the Baseline, Combat and MNN achieved slightly better batch correction. In summary, Scenario 1 helped us optimize our evaluation pipeline, showing that the Baseline preprocessing produced good quality data for subsequent batch effect correction.Fig. 2: Evaluation scenario 1.Quantitative comparison of ten batch correction methods measuring batch effect removal (four batch correction metrics) and conservation of biological variance (six bio-metrics). Metrics are mean aggregated by category. Overall score is the weighted sum of aggregated batch correction and bio-metrics with 0.4 and 0.6 weights respectively.Scenario 2: Single microscope type, multiple laboratories, few compounds, multiple replicatesIn this scenario we analyzed the Target2 302 compounds from 43 experimental runs/batches produced by three laboratories using the same model of microscope. Scenario 2 has the same compounds as Scenario 1, but includes two additional laboratories and therefore has more replicates (Fig. 1). We considered laboratory ID (identifier) as the batch variable because it is the most dominant confounding source (Supplementary Fig. 3). The Baseline approach was not able to integrate data from multiple laboratories, and batch effects were notable in the embedding, with the clusters being driven by the batch variable.Unlike Scenario 1, Scenario 2 revealed variations in the efficacy of the methods when removing the confounding variable effect. Scanorama, the Seurat methods, Harmony, fastMNN, and scVI clustered samples from multiple laboratories more effectively than other methods. On the other hand, the Sphering correction did not improve the performance with respect to the Baseline. MNN, Combat and DESC did not differ substantially from the Baseline. When observing the embeddings labeled by compound (Supplementary Fig. 3), Scanorama, the Seurat methods, Harmony, fastMNN, and scVI were able to group samples from the same compound. On the other hand, the Baseline and Sphering showed clusters with laboratory ID as the major separation criteria, creating one cluster per laboratory–compound pair.Consistent with these qualitative observations, Scanorama, the Seurat methods, and Harmony were also the top performer in the quantitative metrics (Fig. 3) for both batch removal and conservation of biological variance criteria. Taken together, we observed that by introducing stronger technical variations, i.e. analyzing data from multiple laboratories, the performance of batch effect correction methods decreased compared to Scenario 1; however, methods’ rankings remained relatively consistent, with Harmony and all nearest neighbor-based methods except for MNN showing superior performance.Fig. 3: Evaluation scenario 2.Quantitative comparison of ten batch correction methods measuring batch effect removal (four batch correction metrics) and conservation of biological variance (six bio-metrics). Metrics are mean aggregated by category. Overall score is the weighted sum of aggregated batch correction and bio-metrics with 0.4 and 0.6 weights respectively.Scenario 3: Single microscope type, multiple laboratories, multiple compounds, few replicatesIn this scenario we analyzed 82,278 compounds from 43 experimental batches produced by three laboratories using the same model of microscope. We again used laboratory ID as the batch variable because it is the most dominant confounding source (Supplementary Fig. 4). This scenario posed an additional challenge due to the reduced number of replicates for most of the compounds; around 15,000 compounds had only one replicate, and ~ 79,000 had 3 or fewer replicates (Supplementary Fig. 7). Importantly, the eight positive controls had around 2500 replicates each.Quantitatively (Fig. 4), the Seurat methods, Scanorama, fastMNN, Harmony, and scVI again obtained better results than the Baseline, Sphering, Combat, and MNN achieved comparable performances to the Baseline, while DESC underperformed in both bio-metrics and batch metrics. Overall, the increased complexity of the dataset resulted in all methods struggling to remove the batch effects, which remained notable after correction attempts. Compared to Scenario 2, the methods were generally less effective when dealing with more compounds and fewer replicates.Fig. 4: Evaluation scenario 3.Quantitative comparison of ten batch correction methods measuring batch effect removal (four batch correction metrics) and conservation of biological variance (six bio-metrics). Metrics are mean aggregated by category. Overall score is the weighted sum of aggregated batch correction and bio-metrics with 0.4 and 0.6 weights respectively.Scenario 4: Multiple microscope types, multiple laboratories, few compounds, multiple replicatesIn this scenario we analyzed the Target2 302 compounds from 46 experimental runs/batches produced by five laboratories using three different high-throughput imaging systems. Three sources used the CellVoyager CV8000 system, one source used the ImageXpress Micro Confocal system, and one source used the Opera Phenix system. This Scenario was similar to Scenario 2 given the same number of unique compounds; however, in Scenario 4 the batch effects are mainly influenced by the differences in imaging technologies (Fig. 5D) We again considered laboratory ID as the batch variable to stay consistent with Scenarios 2 and 3.Fig. 5: Evaluation scenario 4.A Quantitative comparison of ten batch correction methods measuring batch effect removal (four batch correction metrics) and conservation of biological variance (six bio-metrics). Metrics are mean aggregated by category. Overall score is the weighted sum of aggregated batch correction and bio-metrics with 0.4 and 0.6 weights respectively. Visualization of integrated data colored by B Compound, C Laboratory, and D Microscope. Left-to-right layout reflects the methods’ descending order of performance. We selected 18 out of 302 compounds with replicates in different well positions to account for position effects that may cause profiles to look similar; the embeddings are the same across B-D but samples treated with compounds other than the selected 18 are not shown in B. Alphanumeric IDs denote positive controls. Source data are provided as a Source Data file.The top six methods ranked by the quantitative metrics were also qualitatively better (Fig. 5B, C, D), with Seurat CCA generating the best quantitative results (Fig. 5A). Similar to Scenario 2 and 3, MNN and Combat did not differ substantially from the Baseline. scVI outperformed linear models, and DESC and Sphering underperformed with respect to the Baseline. Compared to Scenario 2, the performance consistently decreased across methods and metrics, with the batch correction metrics exhibiting the highest drop. This indicates that the introduction of multiple microscope types had a strong impact in each method’s ability to align the data.Compared to Scenario 2, the performance consistently decreased across all of the methods in most of the metrics, and the batch metrics exhibited the highest drop, indicating that differences in instrumentation had a strong impact on the methods’ ability to align the data. Notably, the bio-metrics did not change substantially compared to Scenario 2, suggesting that relevant patterns can still be uncovered in the presence of substantial batch effects so long as the batch is not confounded with the experimental factors of interest.Scenario 5: Multiple microscope types, multiple laboratories, multiple compounds, few replicatesIn the final, most complex scenario, we analyzed 82,412 compounds from 60 experimental runs/batches produced by five laboratories using different microscope systems as described in Scenario 4. We used laboratory ID as the batch variable. Again, we found that the differences between microscopes were the strongest confounding factor (Supplementary Fig. 5C). The 2D embeddings for Combat, MNN and Sphering form several clusters; however those are still dominated by the source (Supplementary Fig. 5B). Qualitative comparison helped us differentiate between methods that failed to integrate data from different sources and/or laboratories, but was less useful to check whether they preserved the biological information given the number of compounds being plotted in only two dimensions.The quantitative results (Fig. 6) revealed that Seurat CCA outperformed other methods in batch correction metrics, and the Seurat methods and Harmony achieved the highest bio-metrics score. Compared to the Baseline batch correction scores, scVI showed marginal improvement, the linear methods performed similarly, and DESC underperformed. Notably, this scenario yielded the weakest overall performance, with a substantial decline in batch correction scores compared to Scenario 3. This decline can be attributed to the integration of data from diverse microscope systems along with a large number of unique compounds.Fig. 6: Evaluation scenario 5.Quantitative comparison of ten batch correction methods measuring batch effect removal (four batch correction metrics) and conservation of biological variance (six bio-metrics). Metrics are mean aggregated by category. Overall score is the weighted sum of aggregated batch correction and bio-metrics with 0.4 and 0.6 weights respectively.

Hot Topics

Related Articles