PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

Published 31 Mar 2023 in cs.LG, cs.CR, cs.IT, and math.IT | (2304.00047v1)

Abstract: Allowing organizations to share their data for training of ML models without unintended information leakage is an open problem in practice. A promising technique for this still-open problem is to train models on the encoded data. Our approach, called Privately Encoded Open Datasets with Public Labels (PEOPL), uses a certain class of randomly constructed transforms to encode sensitive data. Organizations publish their randomly encoded data and associated raw labels for ML training, where training is done without knowledge of the encoding realization. We investigate several important aspects of this problem: We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user (e.g., adversary) and a faithful user (e.g., model developer) that have access to the published encoded data. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks. Empirically, we compare the performance of our randomized encoding scheme and a linear scheme to a suite of computational attacks, and we also show that our scheme achieves competitive prediction accuracy to raw-sample baselines. Moreover, we demonstrate that multiple institutions, using independent random encoders, can collaborate to train improved ML models.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces a randomized encoding scheme that uses deep neural networks to transform sensitive datasets for privacy-preserving machine learning.
It establishes novel information-theoretic privacy and utility scores to rigorously compare the encoding’s performance against traditional linear methods.
Empirical results show that PEOPL maintains competitive predictive performance while enhancing security in multi-institutional collaborative environments.

Overview of "PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels"

The paper "PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels" introduces a novel approach to address the challenge of data sharing for machine learning model training while ensuring data privacy. This work is situated in the context of increasing demand for data collaboration across organizations, often hampered by privacy concerns and regulatory constraints like HIPAA and GDPR. The proposed solution, PEOPL, uses a key-based randomized encoding framework that allows data sharing by transforming sensitive datasets into a form that is more amenable to public sharing and model training.

Key Contributions

Randomized Encoding Scheme: PEOPL employs a class of randomly constructed transforms to encode sensitive datasets. The core idea is to use random deep neural networks as encoding functions, chosen from a distribution, ensuring that the exact transformation is not known during model training.
Privacy and Utility Scores: The paper introduces information-theoretic metrics to evaluate the privacy and utility of the encoded datasets. These scores quantify the uncertainty an adversary and a model developer have concerning the encoded data, allowing rigorous assessment of the encoding scheme’s effectiveness.
Empirical Comparisons and Performance: The paper provides empirical evidence showing that the randomized encoding scheme outperforms linear encoding approaches on privacy metrics while maintaining competitive predictive performance relative to models trained on non-encoded data.
Collaborative Learning: PEOPL allows multiple institutions to independently encode their datasets using different random encodings and still train effective models collaboratively. This feature is particularly useful in multi-institutional scenarios where datasets can be pooled together to improve model accuracy without compromising privacy.

Theoretical and Practical Implications

Function Composition: The paper theoretically elucidates that composing families of functions—by constructing deeper networks with both linear and non-linear layers—can improve the privacy score of the encoding scheme. This insight is pragmatic for constructing robust encoding schemes that are less susceptible to reconstruction attacks.
Security Analysis: While the paper does not claim perfect privacy, it performs adversarial experiments to test the robustness of the encoding against various attacks, evidencing improved resilience over traditional schemes. It also highlights conditions under which encoded data might remain vulnerable, stressing the importance of careful deployment.
Future Directions: Although promising, the research opens several new avenues for exploration. Future work could explore encoding schemes that adapt based on the dataset characteristics, hybrid models combining different encoding techniques, and advanced theoretical frameworks to model the trade-offs between privacy, computational overhead, and utility even more precisely.

The introduction of PEOPL and its systematic evaluation highlights a significant step towards practical, privacy-preserving data sharing in machine learning, accommodating a spectrum of use-cases from sensitive healthcare data to corporate datasets. The work encourages further exploration of non-linear, randomized encoding networks in practical settings, fostering secure collaborative environments.

Markdown Report Issue