Image Analysis: Understanding and Mitigating Batch Effects in Histopathology

DOI: https://doi.org/10.47184/tp.2025.01.03

Batch effects are systematic variations, such as differences in staining or scanning protocols, that obscure true biological differences in samples. Recently trained foundation models for pathology have proven to capture morphological information effectively. However, we observe that these models may also retain irrelevant information associated with technical batch effects. While this might not harm the overall goal of improving the downstream task performance, model representations should still be robust to external distribution shifts. We advocate for systematic batch effect analysis in histopathology workflows to ensure reliable and generalizable AI models for clinical applications.

Sources of Batch Effects in Histopathology

Histopathology image analysis frequently encounters batch effects, which are systematic variations arising from differences in experimental conditions rather than genuine biological changes. These variations can originate from both technical and biological sources [1, 2]. Technical batch effects typically stem from inconsistencies during sample preparation (e. g., fixation and staining protocols), imaging processes (scanner types, resolution, and postprocessing), and artifacts such as tissue folds or coverslip misplacements. Biological batch effects, on the other hand, result from disease or patient-specific covariates like disease progression stage, age, sex, or race.

Batch effects pose significant problems in histopathological image analysis, as they can mask actual biological differences between samples, introduce false correlations, and impair model accuracy and generalization [3–7]. Therefore, batch correction methods aim at addressing technical variations while keeping biological signals intact. However, distinguishing between technical and biological sources remains challenging. Similarly, eliminating technical batch effects completely is rarely feasible, especially in multi-site studies involving heterogeneous conditions and populations [8, 9]. Extensive studies and methods addressing batch effect correction have been developed in domains such as single-cell RNA sequencing (e. g., ComBat [10], BBKNN [11], Harmony [12], Scanorama [13]), but these techniques are tailored for tabular data, limiting their direct application to histopathology.

The Age of Foundation Models for Pathology: Are They Robust to Clinical Domains?

Foundation models in pathology have demonstrated large performance gains on downstream tasks through self-supervised learning on large-scale datasets
[14, 15]. However, batch effects are not analyzed systematically despite their frequent occurrence [16]. Recently, studies have shown that models are potentially not robust to clinical site-specific effects [8, 9, 17], especially on difficult tasks like mutation prediction or cancer-staging from pathology images. Here, we advocate for including a systematic batch effect analysis in histopathology workflows by visualizing and quantifying batch effects associated with known covariates.

In particular, low-dimensional feature representations should be analyzed in connection with metadata, including technical variations (covariates) for each image, such as the clinical site, experiment number, staining protocols, or scanners and biological labels (Fig. 2a–d).

**Figure 2: Workflow to assess batch effects present in data in quantitative and qualitative manner.** a) Data pooling with all the relevant metadata; b) Whole Slide Image (WSI) preprocessing steps results and patch generation; c) Feature extraction from patches based on either hand-crafted features or pretrained networks; d) Preparation of the features and metadata table; e) Low-dimensional feature representation using UMAP colored by different covariates (columns) for different feature extractors (rows) for two patch datasets from patients with colorectal cancer (n = 9,650, randomly sampled patches from the datasets TCGA [64], CPTAC [65], and PAIP [66]) and with lung cancer (n = 300, randomly sampled patches from TCGA, CPTAC and Pennycuick [67]); f) Quantitative metrics (mean local diversity (mLD) and Silhouette score) to assess the batch effect in the extracted features with respect to data sources and labels. Higher mixing is better for data sources, lower mixing is better for labels.

Low-dimensional embeddings obtained using PCA [18], and manifold learning-based methods like t-SNE [19, 20] or UMAP [21] are qualitative methods to visualize local and global embedding similarity within a dataset [22–25]. As a result, visual clustering color-coded by the covariates can be investigated to observe whether the batches are separated [10] (Fig. 2e). While UMAP and t-SNE can give an understanding of batch effects, it is important not to over-interpret them, as they are sensitive to their input parameters and noise [20, 26]. Clustering metrics like the Silhouette score and the mean local diversity are useful to obtain a quantified value to clusters. These metrics can indicate whether batches are evenly represented or if there’s a significant imbalance and thus help get started with the right Path-FM for your task (Fig. 1a, Fig. 2f).

**Figure 1: Overview schematic showing different batch effects that exist in histopathology data and what are the current methods to mitigate them**. a) Evaluating batch effects in the data pooled from multiple sources; b) Batch correction in Image Space: This involves quality control and color normalization and/or augmentation for further downstream analysis; c) Batch correction in Feature Space: Representation learning based methods on patch and slide level for various downstream analysis.

One can also use a simple classifier (e. g., a random forest) to evaluate performance metrics like accuracy or F1-score of Path-FM representations to understand the batch influence on their data.

Correcting Batch Effects Present in Histopathology Images

Although complete standardization during data acquisition is challenging – owing to variables like stain degradation, variations in tissue sectioning, and unavoidable artifacts such as tissue folding – batch correction methods can still reduce these effects. Current practices for automated batch correction can be broadly grouped into image-space methods and feature-space methods.

Correction in Image Space

Quality control (Fig. 1b) in histopathology is crucial to ensure that digital images do not contain artifacts such as uneven illumination, pen markings, folded tissues, or out-of-focus regions. Additionally, tissue segmentation, crucial for whole-slide image processing, can be challenging, for example, due to low-intensity staining areas or immunohistochemistry staining. Pipelines proposing solutions are HistoQC [27], PyHist [28], HistomicsTK [29], GrandQC [30], and Trident [31].

Stain normalization (Fig. 1b) has been widely used, aiming to harmonize the batch effect in histopathology by adapting the test data to a fixed staining pattern from the training domain [32–37]. However, altering the staining pattern while preserving the morphological structure, largely defined by stain colors, is a major challenge. Modern methods improve over traditional methods that aim at matching the color distributions [38] by training generative adversarial networks (GANs) to generate synthetic images [35, 39, 40]. In general, stain normalization decouples the harmonization from the downstream task. Instead, data augmentation methods aim at increasing the data heterogeneity during training to learn better representations that can handle more diverse inputs by optimizing the choice of augmentations [33, 41] or applying histology-specific GAN-based synthetic data augmentation [42, 43].

Correction in Feature Space

Representation learning (Fig. 1c) aims at compressing information contained in data into a low-dimensional feature vector. Particularly in histopathology, this representation should reflect the most informative features representing all biological or morphological information, ideally emphasizing differences among phenotypical subpopulations rather than batch-specific artifacts. In this regard, Self-Supervised Learning (SSL) has gained popularity in digital pathology, with benchmarks demonstrating strong model performance when trained on diverse, domain-specific datasets [44 –46]. Recent pathology foundation models (Path-FMs) benefit from large-scale datasets and SSL training, dominating pathology downstream tasks with their exceptional pattern recognition capabilities [14, 15, 47]. However, choosing the best model depends on the specific dataset and downstream task. Although Path-FMs represent state-of-the-art feature extraction tools, they vary considerably in model generalization and robustness [9]. Efforts such as distilling large foundation models into smaller models like H0-mini [17] have shown promise, achieving competitive performance and enhanced robustness to batch effects.

A potential reason Path-FMs exhibit batch effects is shortcut learning [8, 48], where models prioritize easily accessible but less predictive features like object texture or background in natural images and predominant features like staining or tissue thickness on pathology images rather than true biological signals. These site-specific signatures are embedded deeply in feature vectors, impacting performance on downstream tasks. Traditional stain normalization methods only partially mitigate these effects, leaving supervised learning methods vulnerable to batch-related artifacts.

Conversely, multi-modal foundation models integrate diverse modalities, indirectly reducing batch effects by associating visually distinct samples with shared semantics in the training objective. Vision-language pretrained models, such as CLIP and several histopathology equivalents (e. g., CONCH [15], TITAN [49], PRISM [50], HistoGPT [51], Chief [52], GigaPath [53], PathAlign [54], MUSK [55], PLIP [56], BioMed-CLIP [57]), demonstrate enhanced robustness under distribution shifts. Modalities like BulkRNA-seq and scRNA-seq, known for robust batch correction methods, further aid pathology model pretraining and have improved model performance across various datasets (e. g., TANGLE [58], THREADS [59], PORPOISE [60], GraphWSI [61]). In a rapidly moving field, public multi-modal datasets (HEST-1k [62], Quilt-1m [63], PMC-15M – BiomedCLIP [57]), now scaling to millions of pairs, offer valuable platforms for developing robust self-supervised foundation models. While such multi-modal training holds promise, caution is essential to avoid incorporating dominant site-specific information inadvertently. For example, Howard et al. [16] demonstrate that a multi-modal survival analysis model effectively integrates strong site information: using only site information as input can achieve nearly the same accuracy as the original model. Therefore, implementing site-preserved k-fold cross-validation is recommended.

Conclusion

As tissue image analysis research rapidly adapts deep learning to downstream tasks, including knowledge discovery and technological diagnostic improvements, data should reflect the broad spectrum of cases so that methods can generalize well to the real-world setting. Since current batch correction methods cannot resolve batch effects entirely, we advocate for a systematic analysis of batch effects in the data throughout the computational pipeline. The rapidly expanding landscape of computational tools in histopathology not only presents challenges in evaluating batch effects, but also provides a valuable opportunity to develop rigorous testing frameworks that ensure reliable and robust disease diagnosis.

Funding

C. M. acknowledges funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (Grant Agreement No. 866411 & 101113551) and support from the Hightech Agenda Bayern. R. G, S. J. W. and S. S. B. were supported by the Helmholtz Association under the joint research school Munich School for Data Science (MUDS). S. J. W. is supported by the Add-on Fellowship of the Joachim Herz Foundation

Bibliography

1. Nyamundanda G et al. Sci Rep 2017;7:10849.
2. Hari SN et al. bioRxiv 2021. doi:10.1101/2021.09.14.460365.
3. Aubreville M et al. Quantifying the scanner-induced domain gap in mitosis detection. arXiv
[cs.CV] 2021.
4. Yagi Y. Diagn Pathol 2011;6(Suppl 1):15.
5. Aziz MA et al. Artif Life Robot 2019;24:28–37.
6. Azevedo Tosta TA, de Faria PR, Neves LA & do Nascimento MZ. Artif Intell Med 2019;95:118–132.
7. Salvi M, Acharya UR, Molinari F & Meiburger KM. Comput Biol Med 2021;128:104129.
8. Kömen J et al. Do histopathological foundation models eliminate batch effects? A comparative study. arXiv [cs.LG] 2024.
9. de Jong ED, Marcus E & Teuwen J. Current Pathology Foundation Models are unrobust to Medical Center Differences. arXiv [cs.LG] 2025.
10. Tran HTN et al. Genome Biol 2020;21:12.
11. Polański K et al. Bioinformatics 2020;36:964–965.
12. Korsunsky I et al. Nat Methods 2019;16:1289–1296.
13. Hie BL et al. Nat Protoc 2024;19:2283–2297.
14. Chen RJ et al. Nat Med 2024;30:850–862.
15. Lu MY et al. Nat Med 2024;30:863–874.
16. Howard FM et al. Nat Commun 2021;12:4423.
17. Filiot A et al. Distilling foundation models for robust and efficient models in digital pathology. arXiv [cs.CV] 2025.
18. Dunteman GH. Principal Components Analysis (SAGE, 1989).
19. Hinton GE, Roweis ST. Stochastic neighbor embedding. NIPS'02: Proceedings of the 16th International Conference on Neural Information Processing Systems:857–86.
20. Van Maaten L, Hinton GE. J Mach Learn Res 2008;9:2579–2605.
21. McInnes L et al. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [stat.ML] 2018.
22. Wang Y, LêCao KA. Brief Bioinform 2020;21:1954–1970.
23. Qian WW et al. Bioinformatics 2020;36:i875–i883.
24. Quiros AC, Murray-Smith R & Yuan K. PathologyGAN: Learning deep representations of cancer tissue. Proceedings of the Third Conference on Medical Imaging with Deep Learning, PMLR 2020;121:669–695.
25. Bray MA et al. Workflow and metrics for image quality control in large-scale high-content screens.
J Biomol Screen 2012;17:266–274.
26. Islam MT & Fleischer JW. The shape of attraction in UMAP: Exploring the embedding forces in dimensionality reduction. arXiv [cs.LG] 2025.
27. Janowczyk A et al. HistoQC: An Open-Source Quality Control Tool for Digital Pathology Slides.
JCO Clin Cancer Inform 2019;3:1–7.
28. Muñoz-Aguirre M et al. PyHIST: A Histological Image Segmentation Tool. PLoS Comput Biol 2020;16:e1008349.
29. Lutnick B et al. User friendly, cloud based, whole slide image segmentation. Proc SPIE 2021:11603.
30. Weng Z et al. GrandQC: Nat Commun 2024;15:10685.
31. Zhang A, Jaume G et al. Accelerating data processing and benchmarking of AI models for pathology. arXiv [cs.CV] 2025.
32. Macenko M et al. A method for normalizing histology slides for quantitative analysis. In: 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro (IEEE, 2009). doi:10.1109/isbi.2009.5193250.
33. Tellez D et al. Med Image Anal 2019;58:101544.
34. Zhu JY, Park T, Isola P & Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV). doi:10.1109/iccv.2017.244.
35. Shaban MT et al. Staingan: Stain Style Transfer for Digital Histological Images. 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019):953–956.
36. Vahadane A et al. IEEE Trans Med Imaging 2016;35:1962–1971.
37. Faryna K, van der Laak J & Litjens G. PMLR 2021;143:168–178.
38. Reinhard E et al. IEEE Comput Graph Appl 2001;21:34–41.

39. Bentaieb A, Hamarneh G. IEEE Trans Med Imaging 2018;37:792–802.
40. de Bel T et al. Stain-transforming cycle-consistent generative adversarial networks for improved segmentation of renal histopathology. Proceedings of The 2nd International Conference on Medical Imaging with Deep Learning, PMLR 2019;102:151–163.
41. Faryna K et al. PMLR 2021;143:168–178.
42. Wagner SJ et al. Structure-Preserving Multi-domain Stain Color Augmentation Using Style-Transfer with Disentangled Representations. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021 (Springer International Publishing):257–266.
43. Niehues JM et al. Comput Biol Med 2024;175:108410.
44. Kang M et al. Benchmarking self-supervised learning on diverse pathology datasets. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2023). doi:10.1109/cvpr52729.2023.00326.
45. Campanella G et al. A clinical benchmark of public self-supervised pathology foundation models. arXiv [eess.IV] 2024.
46. Neidlinger P et al. Benchmarking foundation models as feature extractors for weakly-supervised computational pathology. arXiv [eess.IV] 2024.
47. Vorontsov E et al. Virchow: A million-slide digital pathology foundation model. arXiv [eess.IV] 2023.
48. Hermann KL et al. On the foundations of shortcut learning. arXiv [cs.LG] 2023.
49. Ding T et al. Multimodal whole slide foundation model for pathology. arXiv [eess.IV] 2024.
50. Shaikovski G et al. PRISM: A multi-modal generative foundation model for slide-level histopathology. arXiv [eess.IV] 2024.
51. Tran M et al. Generating clinical-grade pathology reports from gigapixel whole slide images with HistoGPT. medRxiv 2024. doi:10.1101/2024.03.15.24304211.
52. Wang X et al. Nature 2024;634:970–978.
53. Xu H et al. Nature 2024;630:181–188.
54. Ahmed F et al. PathAlign: A vision-language model for whole slide images in histopathology. arXiv [cs.CV] 2024.
55. Xiang J et al. Nature 2025;638:769–778.
56. Huang Z et al. Nat Med 2023;29:2307–2316.
57. Zhang S et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv [cs.CV] 2023.
58. Jaume G et al. Transcriptomics-guided slide representation learning in computational pathology. arXiv [cs.CV] 2024.
59. Vaidya A et al. Molecular-driven foundation model for oncologic pathology. arXiv [cs.CV] 2025.
60. Chen RJ et al. Cancer Cell 2022;40:865–878.e6.
61. Zheng Y et al. IEEE Trans Med Imaging 2024;43:3085–3097.
62. Jaume G et al. HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis. arXiv [cs.CV] 2024.
63. Ikezogwo WO et al. Quilt-1M: One million image-text pairs for histopathology. arXiv [cs.CV] 2023.
64. The Cancer Genome Atlas program. www.cancer.gov/tcga 2018.
65. Edwards NJ et al. J Proteome Res 2015;14:2707–2713.
66. Kim K et al. Med Image Anal 2023;89:102886.
67. Gindra RH, et al. Am J Pathol 2024;194:1285–1293.

Authors

Rushin H. Gindra, M. Sc. (Corresponding author)

Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany

Department of Medicine and Health, Technical University Munich, Munich, Bavaria, Germany

rushin.gindra@helmholtz-munich…

Sayedali Shetab Boushehri, M. Sc.

Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany

Data & Analytics (D&A), Roche Innovation Center Munich, Roche Pharma Research and Early Development (pRED), Penzberg, Bavaria, Germany

Sophia J. Wagner, M. Sc.

Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany

School of Computation, Information and Technology, Technical University Munich, Munich, Bavaria, Germany

Manuel Tran, M. Sc.

Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany

School of Computation, Information and Technology, Technical University Munich, Munich, Bavaria, Germany

Dr. Dominik Jens Elias Winter

Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany

Prof. Dr. Julia A. Schnabel

Institute for Machine Learning in Biomedical Imaging, Helmholtz Munich, Munich, Bavaria, Germany

Prof. Dr. Dieter Saur

Department of Medicine and Health, Technical University Munich, Munich, Bavaria, Germany

Dr. Carsten Marr

AI for Health, Helmholtz Munich, Munich, Bavaria, Germany

Dr. Tingying Peng

Helmholtz AI, Helmholtz Munich, Munich, Bavaria, Germany

Sources of Batch Effects in Histopathology

The Age of Foundation Models for Pathology: Are They Robust to Clinical Domains?

Correction in Image Space

Correction in Feature Space

Conclusion

Funding

Authors

From the section