PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs (2406.02958v3)

Published 5 Jun 2024 in cs.LG, cs.AI, cs.CL, cs.CR, and cs.DC

Abstract: On-device training is currently the most common approach for training ML models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($\epsilon=1.29$, $\epsilon=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves LLM performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces PrE-Text, a novel approach that uses differential privacy to generate synthetic text data for federated learning.
It leverages an iterative refinement and LLM-based expansion phase to significantly reduce communication rounds, computation, and privacy costs.
Empirical results demonstrate accuracy improvements of 1.3%-3.8% on small models and enhanced performance on large models like LLaMA-2-7B.

PrE-Text: Training LLMs on Private Federated Data in the Age of LLMs

PrE-Text presents a novel methodology for generating differentially private (DP) synthetic textual data, aiming to surmount the limitations of on-device training in the domain of Federated Learning (FL). The drawbacks of on-device training, such as insufficient device capabilities for large models, extensive communication, and computational overhead, along with deployment difficulties, are significantly mitigated by the proposed method.

The paper's approach, PrE-Text, introduces Private Evolution-Text (PE-Text), leveraging recent algorithmic advancements in DP synthetic data generation. This approach outperformed the traditional small model training done on private data directly on user devices across various privacy regimes. The highlights demonstrate the efficacy of PrE-Text under practical privacy constraints, achieving substantial reductions in communication rounds (up to 9 $\times$ ), client computation per round (up to 6 $\times$ ), and communication costs (up to 100 $\times$ ).

Contributions and Algorithm Design

PrE-Text builds on the principles of Differential Privacy to provide a robust and efficient mechanism for generating synthetic textual datasets:

Differentially Private (DP) Synthetic Text Generation: PrE-Text starts with an initial set of public data samples and iteratively refines these samples through variations guided by private information derived from user data. This iterative process utilizes a variation mechanism specific to text data, involving masked LLMs for refining text samples.
Expand Phase: Crucially, PrE-Text incorporates an expansion phase where the final DP synthetic data obtained from iterative refinement is further expanded using LLMs like LLaMA-2-7B, leveraging their generative capabilities without incurring further privacy cost, thanks to DP's post-processing property.

Experimental Performance

The paper provides empirical evidence through extensive experimentation across various datasets—Jobs, Forums, Microblog, and Code:

Small Models on-device: For smaller models deployable on client devices (e.g., DistilGPT2), PrE-Text synthetic data allowed these models to achieve higher accuracy and lower cross-entropy loss than models trained with traditional DP-FL methods such as DP-FedAvg and DP-FTRL. For example, PrE-Text outperformed other DP training methods at $\epsilon=1.29$ with accuracy improvements ranging from approximately 1.3% to 3.8% across the datasets.
Large Models on-server: For scenarios where on-device storage is infeasible, large models obtained significant improvements when fine-tuned on PrE-Text synthetic data. LLaMA-2-7B models showed notable enhancements in next-token prediction accuracy and cross-entropy loss, improving model utility markedly over the non-finetuned baseline.

The results underscore PrE-Text's superior performance in both small and large model settings while preserving privacy. The efficiency improvements in communication and computation further highlight its practical advantages.

Implications and Future Directions

The implications of PrE-Text are twofold:

Practical Utility in Privacy-Preserving Technologies: By substantially reducing the communication and computational burden, PrE-Text makes the deployment of privacy-preserving LLMs more feasible in real-world applications, such as mobile assistants and personalized education platforms.
Future Development in DP Data Generation: The iterative and expansion-based approach sets the stage for future research on synthetic data generation, not limited to text but potentially adaptable to other data modalities like images and structured data. Improving the variation and expansion mechanisms further could yield even higher fidelity synthetic datasets.

Speculation on Future Developments

Future work in this area might focus on several promising directions:

Advanced Variations Techniques: Integrating more sophisticated text generation techniques, such as incorporating transformers' full capabilities in the variation phase.
Power-Efficient Federated Learning: Enhancing the computational efficiency of the client devices could enable more frequent updates and dynamic adaptation of training protocols.
Combining Synthetic Data with Real Data: Investigating hybrid approaches that combine DP synthetic datasets with carefully aggregated real data could provide even more powerful models without significant privacy trade-offs.

The PrE-Text methodology propounds a significant step forward in the field of privacy-preserving AI, potentially influencing the design and deployment of next-generation user-centric applications while upholding stringent privacy guarantees.

PDF Markdown

Related Papers

GitHub

GitHub - houcharlie/PrE-Text: Implementation for PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs (1 star)

Tweets

https://twitter.com/hou_char/status/1798828261872636068

https://twitter.com/realmofresearch/status/1815818419129966752

https://twitter.com/mctalentowen/status/1798736493227491549