Image Analysis: Understanding and Mitigating Batch Effects in Histopathology

DOI: https://doi.org/10.47184/tp.2025.01.03

Batch effects are systematic variations, such as differences in staining or scanning protocols, that obscure true biological differences in samples. Recently trained foundation models for pathology have proven to capture morphological information effectively. However, we observe that these models may also retain irrelevant information associated with technical batch effects. While this might not harm the overall goal of improving the downstream task performance, model representations should still be robust to external distribution shifts. We advocate for systematic batch effect analysis in histopathology workflows to ensure reliable and generalizable AI models for clinical applications.

Sources of Batch Effects in Histopathology

Histopathology image analysis frequently encounters batch effects, which are systematic variations arising from differences in experimental conditions rather than genuine biological changes. These variations can originate from both technical and biological sources [1, 2]. Technical batch effects typically stem from inconsistencies during sample preparation (e. g., fixation and staining protocols), imaging processes (scanner types, resolution, and postprocessing), and artifacts such as tissue folds or coverslip misplacements. Biological batch effects, on the other hand, result from disease or patient-specific covariates like disease progression stage, age, sex, or race. 

Batch effects pose significant problems in histopathological image analysis, as they can mask actual biological differences between samples, introduce false correlations, and impair model accuracy and generalization [3–7]. Therefore, batch correction methods aim at addressing technical variations while keeping biological signals intact. However, distinguishing between technical and biological sources remains challenging. Similarly, eliminating technical batch effects completely is rarely feasible, especially in multi-site studies involving heterogeneous conditions and populations [8, 9]. Extensive studies and methods addressing batch effect correction have been developed in domains such as single-cell RNA sequencing (e. g., ComBat [10], BBKNN [11], Harmony [12], Scanorama [13]), but these techniques are tailored for tabular data, limiting their direct application to histopathology.

The Age of Foundation Models for Pathology: Are They Robust to Clinical Domains?

Foundation models in pathology have demonstrated large performance gains on downstream tasks through self-supervised learning on large-scale datasets 
[14, 15]. However, batch effects are not analyzed systematically despite their frequent occurrence [16]. Recently, studies have shown that models are potentially not robust to clinical site-specific effects [8, 9, 17], especially on difficult tasks like mutation prediction or cancer-staging from pathology images. Here, we advocate for including a systematic batch effect analysis in histopathology workflows by visua­lizing and quantifying batch effects associated with known covariates. 

In particular, low-dimensional feature representations should be analyzed in connection with metadata, including technical variations (covariates) for each image, such as the clinical site, experiment number, staining protocols, or scanners and biological labels (Fig. 2a–d). 

Low-dimensional embeddings obtained using PCA [18], and manifold learning-based methods like t-SNE [19, 20] or UMAP [21] are qualitative methods to visualize local and global embedding similarity within a dataset [22–25]. As a result, visual clustering color-coded by the covariates can be investigated to observe whether the batches are separated [10] (Fig. 2e). While UMAP and t-SNE can give an understanding of batch effects, it is important not to over-interpret them, as they are sensitive to their input parameters and noise [20, 26]. Clustering metrics like the Silhouette score and the mean local diversity are useful to obtain a quantified value to clusters. These metrics can indicate whether batches are evenly represented or if there’s a significant imbalance and thus help get started with the right Path-FM for your task (Fig. 1a, Fig. 2f). 

One can also use a simple classifier (e. g., a random forest) to evaluate performance metrics like accuracy or F1-score of Path-FM representations to understand the batch influence on their data.

Correcting Batch Effects Present in Histopathology Images

Although complete standardization during data acquisition is challenging – owing to variables like stain degradation, variations in tissue sectioning, and unavoidable artifacts such as tissue folding – batch correction methods can still reduce these effects. Current practices for automated batch correction can be broadly grouped into image-space methods and feature-space methods.

Correction in Image Space

Quality control (Fig. 1b) in histo­pathology is crucial to ensure that digital images do not contain artifacts such as uneven illumination, pen markings, folded tissues, or out-of-focus regions. Additionally, tissue segmentation, crucial for whole-slide image processing, can be challenging, for example, due to low-intensity staining areas or immunohistochemistry staining. Pipelines proposing solutions are HistoQC [27], PyHist [28], HistomicsTK [29], GrandQC [30], and Trident [31].

Stain normalization (Fig. 1b) has been widely used, aiming to harmonize the batch effect in histopathology by adapting the test data to a fixed staining pattern from the training domain [32–37]. However, altering the staining pattern while preserving the morphological structure, largely defined by stain colors, is a major challenge. Modern methods improve over traditional methods that aim at matching the color distributions [38] by training generative adversarial networks (GANs) to generate synthetic images [35, 39, 40]. In general, stain normalization decouples the harmonization from the downstream task. Instead, data augmentation methods aim at increasing the data heterogeneity during training to learn better representations that can handle more diverse inputs by optimizing the choice of augmentations [33, 41] or applying histology-specific GAN-based synthetic data augmentation [42, 43].

Correction in Feature Space

Representation learning (Fig. 1c) aims at compressing information contained in data into a low-dimensional feature vector. Particularly in histopathology, this representation should reflect the most informative features representing all biological or morphological information, ideally emphasizing differences among phenotypical subpopulations rather than batch-specific artifacts. In this regard, Self-Supervised Learning (SSL) has gained popularity in digital pathology, with benchmarks demonstrating strong model performance when trained on diverse, domain-specific datasets [44 –46]. Recent pathology foundation models (Path-FMs) benefit from large-scale datasets and SSL training, dominating pathology downstream tasks with their exceptional pattern recognition capabilities [14, 15, 47]. However, choosing the best model depends on the specific dataset and downstream task. Although Path-FMs represent state-of-the-art feature extraction tools, they vary considerably in model generalization and robustness [9]. Efforts such as distilling large foundation models into smaller models like H0-mini [17] have shown promise, achieving competitive performance and enhanced robustness to batch effects. 

A potential reason Path-FMs exhibit batch effects is shortcut learning [8, 48], where models prioritize easily accessible but less predictive features like object texture or background in natural images and predominant features like staining or tissue thickness on pathology images rather than true biological signals. These site-specific signatures are embedded deeply in feature vectors, impacting performance on downstream tasks. Traditional stain normalization methods only partially mitigate these effects, leaving supervised learning methods vulnerable to batch-related artifacts.

Conversely, multi-modal foundation models integrate diverse modalities, indirectly reducing batch effects by associating visually distinct samples with shared semantics in the training objective. Vision-language pretrained models, such as CLIP and several histopathology equivalents (e. g., CONCH [15], TITAN [49], PRISM [50], HistoGPT [51], Chief [52], GigaPath [53], PathAlign [54], MUSK [55], PLIP [56], BioMed-CLIP [57]), demonstrate enhanced robustness under distribution shifts. Modalities like BulkRNA-seq and scRNA-seq, known for robust batch correction methods, further aid pathology model pretraining and have improved model performance across various datasets (e. g., TANGLE [58], THREADS [59], PORPOISE [60], GraphWSI [61]). In a rapidly moving field, public multi-modal datasets (HEST-1k [62], Quilt-1m [63], PMC-15M – BiomedCLIP [57]), now scaling to millions of pairs, offer valuable platforms for developing robust self-supervised foundation models. While such multi-modal training holds promise, caution is essential to avoid incorporating dominant site-specific information inadvertently. For example, Howard et al. [16] demonstrate that a multi-modal survival analysis model effectively integrates strong site information: using only site information as input can achieve nearly the same accuracy as the original model. Therefore, implementing site-preserved k-fold cross-validation is recommended.

Conclusion

As tissue image analysis research rapidly adapts deep learning to downstream tasks, including knowledge discovery and technological diagnostic improvements, data should reflect the broad spectrum of cases so that methods can generalize well to the real-world setting. Since current batch correction methods cannot resolve batch effects entirely, we advocate for a systematic analysis of batch effects in the data throughout the computational pipeline. The rapidly expanding landscape of computational tools in histopathology not only presents challenges in evaluating batch effects, but also provides a valuable opportunity to develop rigorous testing frameworks that ensure reliable and robust disease diagnosis.

Funding

C. M. acknowledges funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (Grant Agreement No. 866411 & 101113551) and support from the Hightech Agenda Bayern. R. G, S. J. W. and S. S. B. were supported by the Helmholtz Association under the joint research school Munich School for Data Science (MUDS). S. J. W. is supported by the Add-on Fellowship of the Joachim Herz Foundation

Authors
Rushin H. Gindra, M. Sc. (Corresponding author)
Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany
Department of Medicine and Health, Technical University Munich, Munich, Bavaria, Germany
Sayedali Shetab Boushehri, M. Sc.
Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany
Data & Analytics (D&A), Roche Innovation Center Munich, Roche Pharma Research and Early Development (pRED), Penzberg, Bavaria, Germany
Sophia J. Wagner, M. Sc.
Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany
School of Computation, Information and Technology, Technical University Munich, Munich, Bavaria, Germany
Manuel Tran, M. Sc.
Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany
School of Computation, Information and Technology, Technical University Munich, Munich, Bavaria, Germany
Dr. Dominik Jens Elias Winter
Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany
Prof. Dr. Julia A. Schnabel
Institute for Machine Learning in Biomedical Imaging, Helmholtz Munich, Munich, Bavaria, Germany
Prof. Dr. Dieter Saur
Department of Medicine and Health, Technical University Munich, Munich, Bavaria, Germany
Dr. Carsten Marr
AI for Health, Helmholtz Munich, Munich, Bavaria, Germany
Dr. Tingying Peng
Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany
From the section