Empirical investigation of multi-source cross-validation in clinical ECG classification (2403.15012v2)
Abstract: Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to accuracy obtained from deploying models to sources not represented in the dataset, such as a new hospital. The increasing availability of multi-source medical datasets provides new opportunities for obtaining more comprehensive and realistic evaluations of expected accuracy through source-level cross-validation designs. In this study, we present a systematic empirical evaluation of standard K-fold cross-validation and leave-source-out cross-validation methods in a multi-source setting. We consider the task of electrocardiogram based cardiovascular disease classification, combining and harmonizing the openly available PhysioNet CinC Challenge 2021 and the Shandong Provincial Hospital datasets for our study. Our results show that K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources. Leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability. The evaluation highlights the dangers of obtaining misleading cross-validation results on medical data and demonstrates how these issues can be mitigated when having access to multi-source data.
- A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, C. Bourn, M. P. Turakhia, and A. Y. Ng, “Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network,” Nature Medicine, vol. 25, pp. 65–69, 2019.
- S. Kiranyaz, T. Ince, and M. Gabbouj, “Real-time patient-specific ECG classification by 1-D convolutional neural networks,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 3, pp. 664–675, 2016.
- S. Celin and K. Vasanth, “ECG signal classification using various machine learning techniques,” Journal of Medical Systems, vol. 42, pp. 65–69, 2018.
- Q. Liu, C. Gao, Y. Zhao, S. Huang, Y. Zhang, and Z. Lu, “ECG abnormality detection based on multi-domain combination features and LSTM,” in 2023 4th International Conference on Computer Engineering and Application (ICCEA), pp. 565–569, 2023.
- E. Merdjanovska and A. Rashkovska, “Cross-database generalization of deep learning models for arrhythmia classification,” in 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), pp. 346–351, 2021.
- P. Rajpurkar, E. Chen, O. Banerjee, and E. J. Topol, “AI in health and medicine,” Nature Medicine, vol. 28, pp. 31–38, 2022.
- B. Han, Q. Yao, T. Liu, G. Niu, I. W. Tsang, J. T. Kwok, and M. Sugiyama, “A survey of label-noise representation learning: Past, present and future,” arXiv preprint arXiv:2011.04406, 2020.
- Z. Zhao, H. Fang, S. D. Relton, R. Yan, Y. Liu, Z. Li, J. Qin, and D. C. Wong, “Adaptive lead weighted ResNet trained with different duration signals for classifying 12-lead ECGs,” in 2020 Computing in Cardiology, vol. 47, pp. 1–4, 2020.
- N. Norori, Q. Hu, F. M. Aellen, F. D. Faraci, and A. Tzovara, “Addressing bias in big data and AI for health care: A call for open science,” Patterns, vol. 2, no. 10, 2021.
- K. Geras and C. Sutton, “Multiple-source cross-validation,” in International Conference on Machine Learning, pp. 1292–1300, PMLR, 2013.
- M. A. Reyna, N. Sadr, E. A. P. Alday, A. Gu, A. J. Shah, C. Robichaux, A. B. Rad, A. Elola, S. Seyedi, S. Ansari, et al., “Will two do? varying dimensions in electrocardiography: the PhysioNet/Computing in Cardiology Challenge 2021,” in 2021 Computing in Cardiology (CinC), vol. 48, pp. 1–4, IEEE, 2021.
- M. A. Reyna, N. Sadr, E. A. Perez Alday, A. Gu, A. J. Shah, C. Robichaux, A. Bahrami Rad, A. Elola, S. Seyedi, S. Ansari, H. Ghanbari, Q. Li, A. Sharma, and G. D. Clifford, “Issues in the automated classification of multilead ECGs using heterogeneous labels and populations,” Physiological measurement, vol. 43, no. 8, 2022.
- A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000.
- H. Liu, D. Chen, D. Chen, X. Zhang, H. Li, L. Bian, M. Shu, and Y. Wang, “A large-scale multi-label 12-lead electrocardiogram database with standardized diagnostic statements,” Scientific Data, vol. 9, 2022.
- F. Jiang, Y. Jiang, H. Zhi, Y. Dong, H. Li, S. Ma, Y. Wang, Q. Dong, H. Shen, and Y. Wang, “Artificial intelligence in healthcare: past, present and future,” Stroke and Vascular Neurology, vol. 2, no. 4, pp. 230–243, 2017.
- V. Kulkarni, M. Gawali, and K. A, “Key technology considerations in developing and deploying machine learning models in clinical radiology practice,” Computer Methods and Programs in Biomedicine, vol. 9, p. e28776, 2021.
- D. Padovano, A. Martinez-Rodrigo, J. M. Pastor, J. J. Rieta, and R. Alcaraz, “Hidden hazards beneath cross-validation methods in machine learning-based sleep apnea detection,” in 2022 Computing in Cardiology (CinC), 2022.
- J. White and S. Power, “k-fold cross-validation can significantly over-estimate true classification accuracy in common EEG-based passive BCI experimental designs: An empirical investigation,” Sensors (Basel), vol. 13, no. 23, 2023.
- S. Kapoor and A. Narayanan, “Leakage and the reproducibility crisis in machine-learning-based science,” Patterns, vol. 4, no. 9, 2023.
- S. Bleeker, H. Moll, E. a. Steyerberg, A. Donders, G. Derksen-Lubsen, D. Grobbee, and K. Moons, “External validation is necessary in prediction research: A clinical example,” Journal of clinical epidemiology, vol. 56, no. 9, pp. 826–832, 2003.
- R. Rakotomalala, J.-H. Chauchat, and F. Pellegrino, “Accuracy estimation with clustered dataset,” in Conferences in Research and Practice in Information Technology Series, vol. 245, pp. 17–22, 2006.
- J. Knight, G. W. Taylor, and A. Khademi, “Voxel-wise logistic regression and leave-one-source-out cross validation for white matter hyperintensity segmentation,” Magnetic resonance imaging, vol. 54, pp. 119–136, 2018.
- H. Han, S. Park, S. Min, H.-S. Choi, E. Kim, H. Kim, S. Park, J. Kim, J. Park, J. An, et al., “Towards high generalization performance on electrocardiogram classification,” in 2021 Computing in Cardiology (CinC), vol. 48, pp. 1–4, IEEE, 2021.
- S. Tabe-Bordbar, A. Emad, S. Zhao, and S. Sinha, “A closer look at cross-validation for assessing the accuracy of gene regulatory networks and models,” Scientific Reports, vol. 8, 04 2018.
- C. J. McWilliams, D. J. Lawson, R. Santos-Rodriguez, I. D. Gilchrist, A. Champneys, T. H. Gould, M. J. Thomas, and C. P. Bourdeaux, “Towards a decision support tool for intensive care discharge: machine learning algorithm development using electronic healthcare data from MIMIC-III and Bristol, UK,” BMJ open, vol. 9, no. 3, p. e025925, 2019.
- F. Liu, C. Liu, L. Zhao, X. Zhang, X. Wu, X. Xu, Y. Liu, C. Ma, S. Wei, Z. He, et al., “An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection,” Journal of Medical Imaging and Health Informatics, vol. 8, no. 7, pp. 1368–1373, 2018.
- R. Bousseljot, D. Kreiseler, and A. Schnabel, “Nutzung der EKG-signaldatenbank CARDIODAT der PTB über das Internet,” Biomedizinische Technik, vol. 40, no. s1, 1995.
- P. Wagner, N. Strodthoff, R.-D. Bousseljot, D. Kreiseler, F. I. Lunze, W. Samek, and T. Schaeffter, “PTB-XL, a large publicly available electrocardiography dataset,” Scientific data, vol. 7, no. 1, 2020.
- J. Zheng, J. Zhang, S. Danioko, H. Yao, H. Guo, and C. Rakovski, “A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients,” Scientific data, vol. 7, no. 48, 2020.
- J. Zheng, H. Chu, D. Struppa, J. Zhang, S. M. Yacoub, H. El-Askary, A. Chang, L. Ehwerhemuepha, I. Abudayyeh, A. Barrett, G. Fu, H. Yao, D. Li, H. Guo, and C. Rakovski, “Optimal multi-stage arrhythmia classification approach,” Scientific reports, vol. 10, no. 2898, pp. 1–17, 2020.
- E. A. Perez Alday, A. Gu, A. J Shah, C. Robichaux, A.-K. Ian Wong, C. Liu, F. Liu, A. Bahrami Rad, A. Elola, S. Seyedi, Q. Li, A. Sharma, G. D. Clifford, and M. A. Reyna, “Classification of 12-lead ECGs: the PhysioNet/Computing in Cardiology Challenge 2020,” Physiological measurement, vol. 41, no. 12, 2021.
- V. Tihonenko, A. Khaustov, S. Ivanov, A. Rivin, and E. Yakushenko, “St Petersburg INCART 12-lead Arrhythmia Database,” PhysioBank, PhysioToolkit, and PhysioNet, 2008.
- J. W. Mason, E. W. Hancock, and L. S. Gettes, “Recommendations for the standardization and interpretation of the electrocardiogram,” Circulation, vol. 115, no. 10, pp. 1325–1332, 2007.
- J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (ROC) curve,” Radiology, vol. 143, no. 1, pp. 29–36, 1982.
- S. Bates, T. Hastie, and R. Tibshirani, “Cross-validation: what does it estimate and how well does it do it?,” Journal of the American Statistical Association, pp. 1–12, 2023.
- R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’95, p. 1137–1143, Morgan Kaufmann Publishers Inc., 1995.
- K. Sechidis, G. Tsoumakas, and I. Vlahavas, “On the stratification of multi-label data,” in Machine Learning and Knowledge Discovery in Databases, Part III (D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, eds.), Lecture Notes in Computer Science, pp. 145–158, Springer Berlin, Heidelberg, 2011.
- H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
- J. Futoma, M. Simons, T. Panch, and F. Doshi-Velez, “The myth of generalisability in clinical research and machine learning in health care,” The Lancet: Digital Health, vol. 2, no. 9, pp. E489–E492, 2020.
- A. Arora, J. E. Alderman, J. Palmer, S. Ganapathi, E. Laws, M. D. McCradden, L. Oakden-Rayner, S. R. Pfohl, M. Ghassemi, F. McKay, D. Treanor, N. Rostamzadeh, B. Mateen, J. Gath, A. O. Adebajo, S. Kuku, R. Matin, K. Heller, E. Sapey, N. J. Sebire, H. Cole-Lewis, M. Calvert, A. Denniston, and X. Liu, “The value of standards for health datasets in artificial intelligence-based applications,” Nat. Med., 2023.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.
- E. Do, J. Boynton, B. S. Lee, and D. Lustgarten, “Data augmentation for 12-lead ECG beat classification,” SN Comput. Sci., vol. 3, Jan. 2022.
- A. M. Shaker, M. Tantawi, H. A. Shedeed, and M. F. Tolba, “Generalization of convolutional neural networks for ECG classification using generative adversarial networks,” IEEE Access, vol. 8, pp. 35592–35605, 2020.
- Z. Ebrahimi, M. Loni, M. Daneshtalab, and A. Gharehbaghi, “A review on deep learning methods for ECG arrhythmia classification,” Expert Systems with Applications: X, vol. 7, p. 100033, 2020.
- S. Hiriyannaiah, Siddesh, Kiran, and K. G. Srinivasa, “A comparative study and analysis of LSTM deep neural networks for heartbeats classification,” Health Technol. (Berl.), vol. 11, pp. 663–671, May 2021.