Image Analysis: Understanding and Mitigating Batch Effects in Histopathology
DOI: https://doi.org/10.47184/tp.2025.01.03Batch effects are systematic variations, such as differences in staining or scanning protocols, that obscure true biological differences in samples. Recently trained foundation models for pathology have proven to capture morphological information effectively. However, we observe that these models may also retain irrelevant information associated with technical batch effects. While this might not harm the overall goal of improving the downstream task performance, model representations should still be robust to external distribution shifts. We advocate for systematic batch effect analysis in histopathology workflows to ensure reliable and generalizable AI models for clinical applications.
Sources of Batch Effects in Histopathology
Histopathology image analysis frequently encounters batch effects, which are systematic variations arising from differences in experimental conditions rather than genuine biological changes. These variations can originate from both technical and biological sources [1, 2]. Technical batch effects typically stem from inconsistencies during sample preparation (e. g., fixation and staining protocols), imaging processes (scanner types, resolution, and postprocessing), and artifacts such as tissue folds or coverslip misplacements. Biological batch effects, on the other hand, result from disease or patient-specific covariates like disease progression stage, age, sex, or race.
Batch effects pose significant problems in histopathological image analysis, as they can mask actual biological differences between samples, introduce false correlations, and impair model accuracy and generalization [3–7]. Therefore, batch correction methods aim at addressing technical variations while keeping biological signals intact. However, distinguishing between technical and biological sources remains challenging. Similarly, eliminating technical batch effects completely is rarely feasible, especially in multi-site studies involving heterogeneous conditions and populations [8, 9]. Extensive studies and methods addressing batch effect correction have been developed in domains such as single-cell RNA sequencing (e. g., ComBat [10], BBKNN [11], Harmony [12], Scanorama [13]), but these techniques are tailored for tabular data, limiting their direct application to histopathology.
The Age of Foundation Models for Pathology: Are They Robust to Clinical Domains?
Foundation models in pathology have demonstrated large performance gains on downstream tasks through self-supervised learning on large-scale datasets
[14, 15]. However, batch effects are not analyzed systematically despite their frequent occurrence [16]. Recently, studies have shown that models are potentially not robust to clinical site-specific effects [8, 9, 17], especially on difficult tasks like mutation prediction or cancer-staging from pathology images. Here, we advocate for including a systematic batch effect analysis in histopathology workflows by visualizing and quantifying batch effects associated with known covariates.
In particular, low-dimensional feature representations should be analyzed in connection with metadata, including technical variations (covariates) for each image, such as the clinical site, experiment number, staining protocols, or scanners and biological labels (Fig. 2a–d).

Figure 2: Workflow to assess batch effects present in data in quantitative and qualitative manner. a) Data pooling with all the relevant metadata; b) Whole Slide Image (WSI) preprocessing steps results and patch generation; c) Feature extraction from patches based on either hand-crafted features or pretrained networks; d) Preparation of the features and metadata table; e) Low-dimensional feature representation using UMAP colored by different covariates (columns) for different feature extractors (rows) for two patch datasets from patients with colorectal cancer (n = 9,650, randomly sampled patches from the datasets TCGA [64], CPTAC [65], and PAIP [66]) and with lung cancer (n = 300, randomly sampled patches from TCGA, CPTAC and Pennycuick [67]); f) Quantitative metrics (mean local diversity (mLD) and Silhouette score) to assess the batch effect in the extracted features with respect to data sources and labels. Higher mixing is better for data sources, lower mixing is better for labels.
Low-dimensional embeddings obtained using PCA [18], and manifold learning-based methods like t-SNE [19, 20] or UMAP [21] are qualitative methods to visualize local and global embedding similarity within a dataset [22–25]. As a result, visual clustering color-coded by the covariates can be investigated to observe whether the batches are separated [10] (Fig. 2e). While UMAP and t-SNE can give an understanding of batch effects, it is important not to over-interpret them, as they are sensitive to their input parameters and noise [20, 26]. Clustering metrics like the Silhouette score and the mean local diversity are useful to obtain a quantified value to clusters. These metrics can indicate whether batches are evenly represented or if there’s a significant imbalance and thus help get started with the right Path-FM for your task (Fig. 1a, Fig. 2f).

Figure 1: Overview schematic showing different batch effects that exist in histopathology data and what are the current methods to mitigate them. a) Evaluating batch effects in the data pooled from multiple sources; b) Batch correction in Image Space: This involves quality control and color normalization and/or augmentation for further downstream analysis; c) Batch correction in Feature Space: Representation learning based methods on patch and slide level for various downstream analysis.
One can also use a simple classifier (e. g., a random forest) to evaluate performance metrics like accuracy or F1-score of Path-FM representations to understand the batch influence on their data.
Correcting Batch Effects Present in Histopathology Images
Although complete standardization during data acquisition is challenging – owing to variables like stain degradation, variations in tissue sectioning, and unavoidable artifacts such as tissue folding – batch correction methods can still reduce these effects. Current practices for automated batch correction can be broadly grouped into image-space methods and feature-space methods.
Correction in Image Space
Quality control (Fig. 1b) in histopathology is crucial to ensure that digital images do not contain artifacts such as uneven illumination, pen markings, folded tissues, or out-of-focus regions. Additionally, tissue segmentation, crucial for whole-slide image processing, can be challenging, for example, due to low-intensity staining areas or immunohistochemistry staining. Pipelines proposing solutions are HistoQC [27], PyHist [28], HistomicsTK [29], GrandQC [30], and Trident [31].
Stain normalization (Fig. 1b) has been widely used, aiming to harmonize the batch effect in histopathology by adapting the test data to a fixed staining pattern from the training domain [32–37]. However, altering the staining pattern while preserving the morphological structure, largely defined by stain colors, is a major challenge. Modern methods improve over traditional methods that aim at matching the color distributions [38] by training generative adversarial networks (GANs) to generate synthetic images [35, 39, 40]. In general, stain normalization decouples the harmonization from the downstream task. Instead, data augmentation methods aim at increasing the data heterogeneity during training to learn better representations that can handle more diverse inputs by optimizing the choice of augmentations [33, 41] or applying histology-specific GAN-based synthetic data augmentation [42, 43].
Correction in Feature Space
Representation learning (Fig. 1c) aims at compressing information contained in data into a low-dimensional feature vector. Particularly in histopathology, this representation should reflect the most informative features representing all biological or morphological information, ideally emphasizing differences among phenotypical subpopulations rather than batch-specific artifacts. In this regard, Self-Supervised Learning (SSL) has gained popularity in digital pathology, with benchmarks demonstrating strong model performance when trained on diverse, domain-specific datasets [44 –46]. Recent pathology foundation models (Path-FMs) benefit from large-scale datasets and SSL training, dominating pathology downstream tasks with their exceptional pattern recognition capabilities [14, 15, 47]. However, choosing the best model depends on the specific dataset and downstream task. Although Path-FMs represent state-of-the-art feature extraction tools, they vary considerably in model generalization and robustness [9]. Efforts such as distilling large foundation models into smaller models like H0-mini [17] have shown promise, achieving competitive performance and enhanced robustness to batch effects.
A potential reason Path-FMs exhibit batch effects is shortcut learning [8, 48], where models prioritize easily accessible but less predictive features like object texture or background in natural images and predominant features like staining or tissue thickness on pathology images rather than true biological signals. These site-specific signatures are embedded deeply in feature vectors, impacting performance on downstream tasks. Traditional stain normalization methods only partially mitigate these effects, leaving supervised learning methods vulnerable to batch-related artifacts.
Conversely, multi-modal foundation models integrate diverse modalities, indirectly reducing batch effects by associating visually distinct samples with shared semantics in the training objective. Vision-language pretrained models, such as CLIP and several histopathology equivalents (e. g., CONCH [15], TITAN [49], PRISM [50], HistoGPT [51], Chief [52], GigaPath [53], PathAlign [54], MUSK [55], PLIP [56], BioMed-CLIP [57]), demonstrate enhanced robustness under distribution shifts. Modalities like BulkRNA-seq and scRNA-seq, known for robust batch correction methods, further aid pathology model pretraining and have improved model performance across various datasets (e. g., TANGLE [58], THREADS [59], PORPOISE [60], GraphWSI [61]). In a rapidly moving field, public multi-modal datasets (HEST-1k [62], Quilt-1m [63], PMC-15M – BiomedCLIP [57]), now scaling to millions of pairs, offer valuable platforms for developing robust self-supervised foundation models. While such multi-modal training holds promise, caution is essential to avoid incorporating dominant site-specific information inadvertently. For example, Howard et al. [16] demonstrate that a multi-modal survival analysis model effectively integrates strong site information: using only site information as input can achieve nearly the same accuracy as the original model. Therefore, implementing site-preserved k-fold cross-validation is recommended.
Conclusion
As tissue image analysis research rapidly adapts deep learning to downstream tasks, including knowledge discovery and technological diagnostic improvements, data should reflect the broad spectrum of cases so that methods can generalize well to the real-world setting. Since current batch correction methods cannot resolve batch effects entirely, we advocate for a systematic analysis of batch effects in the data throughout the computational pipeline. The rapidly expanding landscape of computational tools in histopathology not only presents challenges in evaluating batch effects, but also provides a valuable opportunity to develop rigorous testing frameworks that ensure reliable and robust disease diagnosis.
Funding
C. M. acknowledges funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (Grant Agreement No. 866411 & 101113551) and support from the Hightech Agenda Bayern. R. G, S. J. W. and S. S. B. were supported by the Helmholtz Association under the joint research school Munich School for Data Science (MUDS). S. J. W. is supported by the Add-on Fellowship of the Joachim Herz Foundation