Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data (2402.00205v2)
Abstract: Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.
- Communication-Efficient Learning of Deep Networks from Decentralized Data. In: Singh A, Zhu XJ, editors. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA. vol. 54 of Proceedings of Machine Learning Research. PMLR; 2017. p. 1273-82. Available from: http://proceedings.mlr.press/v54/mcmahan17a.html.
- Learning Differentially Private Recurrent Language Models. In: International Conference on Learning Representations; 2018. Available from: https://openreview.net/forum?id=BJ0hF1Z0b.
- Differentially Private Federated Learning: A Client Level Perspective. CoRR. 2017;abs/1712.07557. Available from: http://arxiv.org/abs/1712.07557.
- End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nature Machine Intelligence. 2021 Jun;3(6):473-84. Number: 6 Publisher: Nature Publishing Group. Available from: https://www.nature.com/articles/s42256-021-00337-8.
- Deep Learning with Differential Privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM; 2016. Available from: https://doi.org/10.1145/2976749.2978318.
- Swarm Learning for decentralized and confidential clinical machine learning. Nature. 2021 Jun;594(7862):265-70. Number: 7862 Publisher: Nature Publishing Group. Available from: https://www.nature.com/articles/s41586-021-03583-3.
- Scalable Private Learning with PATE. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net; 2018. Available from: https://openreview.net/forum?id=rkZB1XbRZ.
- Membership Inference Attacks Against Machine Learning Models. IEEE Computer Society; 2017. p. 3-18. ISSN: 2375-1207. Available from: https://www.computer.org/csdl/proceedings-article/sp/2017/07958568/12OmNBUAvVc.
- Membership Inference Attacks From First Principles. In: 2022 IEEE Symposium on Security and Privacy (SP); 2022. p. 1897-914. ISSN: 2375-1207. Available from: https://ieeexplore.ieee.org/document/9833649.
- Opacus: User-Friendly Differential Privacy Library in PyTorch. In: NeurIPS 2021 Workshop Privacy in Machine Learning; 2021. Available from: https://openreview.net/forum?id=EopKEYBoI-.
- Patient characteristics, resource use and outcomes associated with general internal medicine hospital care: the General Medicine Inpatient Initiative (GEMINI) retrospective cohort study. CMAJ Open. 2017 Dec;5(4):E842-9. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5741428/.
- Assessing the quality of clinical and administrative data extracted from hospitals: the General Medicine Inpatient Initiative (GEMINI) experience. Journal of the American Medical Informatics Association : JAMIA. 2020 Nov;28(3):578-87. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7936532/.
- Pan SJ, Yang Q. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering. 2010;22(10):1345-59. Available from: https://ieeexplore.ieee.org/document/5288526.
- MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data. 2019 Dec;6(1):317. Number: 1 Publisher: Nature Publishing Group. Available from: https://www.nature.com/articles/s41597-019-0322-0.
- Available from: https://doi.org/10.13026/C2JT1Q.
- PhysioBank, PhysioToolkit, and PhysioNet. Circulation. 2000 Jun;101(23):e215-20. Publisher: American Heart Association. Available from: https://www.ahajournals.org/doi/10.1161/01.cir.101.23.e215.
- Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee; 2009. p. 248-55. Available from: https://ieeexplore.ieee.org/document/5206848.
- Mironov I. Rényi Differential Privacy. In: 2017 IEEE 30th Computer Security Foundations Symposium (CSF); 2017. p. 263-75. Available from: https://ieeexplore.ieee.org/document/8049725.
- One Cell At a Time (OCAT): a unified framework to integrate and analyze single-cell RNA-seq data. Genome Biology. 2022 Apr;23(1):102. Available from: https://doi.org/10.1186/s13059-022-02659-1.
- ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society; 2017. p. 3462-71. Available from: https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.369.
- PadChest: A large chest x-ray image dataset with multi-label annotated reports. Medical Image Analysis. 2020;66:101797. Available from: https://www.sciencedirect.com/science/article/pii/S1361841520301614.
- CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’19/IAAI’19/EAAI’19. AAAI Press; 2019. Available from: https://doi.org/10.1609/aaai.v33i01.3301590.
- On the limits of cross-domain generalization in automated X-ray prediction. In: Arbel T, Ben Ayed I, de Bruijne M, Descoteaux M, Lombaert H, Pal C, editors. Proceedings of the Third Conference on Medical Imaging with Deep Learning. vol. 121 of Proceedings of Machine Learning Research. PMLR; 2020. p. 136-55. Available from: https://proceedings.mlr.press/v121/cohen20a.html.
- TorchXRayVision: A library of chest X-ray datasets and models. In: Konukoglu E, Menze B, Venkataraman A, Baumgartner C, Dou Q, Albarqouni S, editors. Proceedings of The 5th International Conference on Medical Imaging with Deep Learning. vol. 172 of Proceedings of Machine Learning Research. PMLR; 2022. p. 231-49. Available from: https://proceedings.mlr.press/v172/cohen22a.html.
- Practical Secure Aggregation for Privacy-Preserving Machine Learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. CCS ’17. New York, NY, USA: Association for Computing Machinery; 2017. p. 1175–1191. Available from: https://doi.org/10.1145/3133956.3133982.
- Secure Single-Server Aggregation with (Poly)Logarithmic Overhead. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. CCS ’20. New York, NY, USA: Association for Computing Machinery; 2020. p. 1253–1269. Available from: https://doi.org/10.1145/3372297.3417885.