Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data (2402.00205v2)

Published 31 Jan 2024 in cs.LG and cs.CR

Abstract: Machine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration. In this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). It offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets; (2) it safeguards patient privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized server. We demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. We demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing it enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.

References (26)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces DeCaPH, a framework that enables collaborative ML across multi-hospital datasets while ensuring data privacy.
The methodology integrates decentralized learning, secure aggregation, and differential privacy techniques to maintain confidentiality during training.
Empirical evaluations demonstrate enhanced model generalizability with a performance trade-off of under 3.2% and up to a 16% reduction in privacy vulnerabilities.

Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH): Enhancing Model Generalizability without Compromising Data Privacy

Introduction to DeCaPH

The emergence of collaborative ML models in healthcare research signifies a pivotal shift towards leveraging diverse and voluminous datasets to enhance model accuracy and generalizability. However, the fundamental challenge lies in harmonizing the benefits of collaborative learning with the stringent demands of data privacy and regulatory compliance across different healthcare institutions. Addressing this challenge, the paper presents the Decentralized, Collaborative, and Privacy-preserving Machine Learning framework for Multi-Hospital Data (DeCaPH), designed to enable collaborative ML training across multiple institutions without necessitating direct data sharing or infringing upon the privacy of the datasets involved.

Framework Overview

DeCaPH is underpinned by a set of key principles:

Decentralization and Privacy Preservation: By circumventing the need for a centralized data repository and integrating differential privacy, DeCaPH ensures that patient data remains confidential and secure against potential privacy breaches.
Collaborative Learning with Data Diversification: The framework facilitates ML model training across disparate datasets hosted by different hospitals, enhancing the model's generalizability and performance.
Differential Privacy Standardization: DeCaPH adheres to the differential privacy (DP) paradigm, a rigorous standard for privacy protection, preventing any substantial information leakage about individual data points during the training process.

Methodological Innovations

DeCaPH incorporates several methodological innovations to achieve its objectives:

Randomized Leader Selection: Ensures a flexible and dynamic coordination mechanism for model updates without a central server, enhancing the framework's robustness and scalability.
Secure Aggregation: Employs cryptographic secure aggregation to merge model updates from different hospitals, ensuring that the process is impervious to information leakage.
Gradient Clipping and Noise Addition: Integrates DP mechanisms to the gradient updates, which are fundamental to the model training process, thereby providing theoretical guarantees on the privacy preservation of individual data points.

Empirical Evaluation

The efficacy of DeCaPH was rigorously evaluated on three real-world healthcare datasets encompassing electronic health records, single-cell genome classification, and chest radiology image analysis. The comparisons were drawn against models trained locally within each hospital's dataset and other collaborative frameworks. The results demonstrated the superior performance of DeCaPH-trained models in terms of accuracy and generalizability, with nominal performance trade-offs for ensuring differential privacy. Specifically, the models exhibited less than 3.2% drop in performance metrics compared to non-private collaborative models while achieving up to a 16% decrease in vulnerability to privacy attacks.

Implications and Future Perspectives

The findings underscore the potential of DeCaPH to facilitate large-scale collaborative ML projects within the healthcare domain, offering a pragmatic pathway to leveraging the collective utility of multi-institutional healthcare datasets while simultaneously upholding stringent privacy protections.

The implications extend beyond academic discourse, promising substantial benefits to clinical research and patient care alike. By enabling the development of more accurate and generalizable ML models, DeCaPH can significantly enhance the predictive capabilities in various clinical applications, ranging from disease diagnosis to patient outcome prediction.

Looking ahead, the framework opens avenues for further advancements in decentralized learning protocols, exploring scalable solutions to integrate heterogeneous data sources while navigating the complex tapestry of privacy regulations and ethical considerations in healthcare data utilization. The exploration of vertical data integration, adaptation to various learning paradigms, and enhancement in privacy-preserving mechanisms stand out as promising future research directions.

In conclusion, DeCaPH emerges as a compelling framework that balances the scales between the collaborative utility of diverse healthcare datasets and the imperatives of data privacy, marking a step forward in the pursuit of advanced AI-driven healthcare solutions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/litscraper/status/1754096999509766268

https://twitter.com/0xkidwai/status/1753270660808298718

https://twitter.com/FSFG/status/1785488018553364964