Emergent Mind

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

(2406.02958)
Published Jun 5, 2024 in cs.LG , cs.AI , cs.CL , cs.CR , and cs.DC

Abstract

On-device training is currently the most common approach for training ML models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($\epsilon=1.29$, $\epsilon=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves LLM performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.

High-level overview of PrE-Text's two-phase DP synthetic seed collection and expansion process.

Overview

  • PrE-Text introduces a novel methodology for generating differentially private (DP) synthetic textual data to address the limitations of on-device training in Federated Learning (FL).

  • The methodology involves an iterative refinement process utilizing LLMs and achieves substantial reductions in communication rounds, client computation per round, and communication costs.

  • Extensive experiments show PrE-Text's effectiveness in both small and large model settings, highlighting significant improvements in model accuracy and utility while preserving privacy.

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

PrE-Text presents a novel methodology for generating differentially private (DP) synthetic textual data, aiming to surmount the limitations of on-device training in the domain of Federated Learning (FL). The drawbacks of on-device training, such as insufficient device capabilities for large models, extensive communication, and computational overhead, along with deployment difficulties, are significantly mitigated by the proposed method.

The paper's approach, PrE-Text, introduces Private Evolution-Text (PE-Text), leveraging recent algorithmic advancements in DP synthetic data generation. This approach outperformed the traditional small model training done on private data directly on user devices across various privacy regimes. The highlights demonstrate the efficacy of PrE-Text under practical privacy constraints, achieving substantial reductions in communication rounds (up to 9$\times$), client computation per round (up to 6$\times$), and communication costs (up to 100$\times$).

Contributions and Algorithm Design

PrE-Text builds on the principles of Differential Privacy to provide a robust and efficient mechanism for generating synthetic textual datasets:

  1. Differentially Private (DP) Synthetic Text Generation: PrE-Text starts with an initial set of public data samples and iteratively refines these samples through variations guided by private information derived from user data. This iterative process utilizes a variation mechanism specific to text data, involving masked language models for refining text samples.
  2. Expand Phase: Crucially, PrE-Text incorporates an expansion phase where the final DP synthetic data obtained from iterative refinement is further expanded using LLMs like LLaMA-2-7B, leveraging their generative capabilities without incurring further privacy cost, thanks to DP's post-processing property.

Experimental Performance

The paper provides empirical evidence through extensive experimentation across various datasets—Jobs, Forums, Microblog, and Code:

  • Small Models on-device: For smaller models deployable on client devices (e.g., DistilGPT2), PrE-Text synthetic data allowed these models to achieve higher accuracy and lower cross-entropy loss than models trained with traditional DP-FL methods such as DP-FedAvg and DP-FTRL. For example, PrE-Text outperformed other DP training methods at $\epsilon=1.29$ with accuracy improvements ranging from approximately 1.3% to 3.8% across the datasets.
  • Large Models on-server: For scenarios where on-device storage is infeasible, large models obtained significant improvements when fine-tuned on PrE-Text synthetic data. LLaMA-2-7B models showed notable enhancements in next-token prediction accuracy and cross-entropy loss, improving model utility markedly over the non-finetuned baseline.

The results underscore PrE-Text's superior performance in both small and large model settings while preserving privacy. The efficiency improvements in communication and computation further highlight its practical advantages.

Implications and Future Directions

The implications of PrE-Text are twofold:

  1. Practical Utility in Privacy-Preserving Technologies: By substantially reducing the communication and computational burden, PrE-Text makes the deployment of privacy-preserving language models more feasible in real-world applications, such as mobile assistants and personalized education platforms.
  2. Future Development in DP Data Generation: The iterative and expansion-based approach sets the stage for future research on synthetic data generation, not limited to text but potentially adaptable to other data modalities like images and structured data. Improving the variation and expansion mechanisms further could yield even higher fidelity synthetic datasets.

Speculation on Future Developments

Future work in this area might focus on several promising directions:

  • Advanced Variations Techniques: Integrating more sophisticated text generation techniques, such as incorporating transformers' full capabilities in the variation phase.
  • Power-Efficient Federated Learning: Enhancing the computational efficiency of the client devices could enable more frequent updates and dynamic adaptation of training protocols.
  • Combining Synthetic Data with Real Data: Investigating hybrid approaches that combine DP synthetic datasets with carefully aggregated real data could provide even more powerful models without significant privacy trade-offs.

The PrE-Text methodology propounds a significant step forward in the realm of privacy-preserving AI, potentially influencing the design and deployment of next-generation user-centric applications while upholding stringent privacy guarantees.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.