Large Language Models Can Be Strong Differentially Private Learners (2110.05679v6)

Published 12 Oct 2021 in cs.LG and cs.CL

Abstract: Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained LLMs; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained LLMs doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at https://github.com/lxuechen/private-transformers.

Citations (323)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning large pretrained language models with DP-SGD and ghost clipping effectively mitigates performance drops in differentially private training.
The research introduces a novel memory-efficient technique that avoids per-example gradient instantiation, reducing resource overhead during DP optimization.
Empirical results reveal that DP-trained LLMs match or surpass non-private baselines, paving the way for practical, privacy-preserving NLP applications.

LLMs as Strong Differentially Private Learners

The paper "LLMs Can Be Strong Differentially Private Learners" by Li et al. investigates the application of Differential Privacy (DP) to LLMs within NLP tasks. While DP is a recognized framework for privacy in machine learning, successfully applying it to high-dimensional, parameter-heavy transformers typically results in significant performance degradation. This paper identifies and addresses key challenges within this domain, demonstrating that LLMs can achieve competitive performance while ensuring strong privacy guarantees via Differentially Private Stochastic Gradient Descent (DP-SGD).

Key Contributions and Findings

Mitigating Performance Drop in DP-LLMs:
- This work shows that the performance drop observed in differentially private training can be effectively mitigated through strategic use of large pretrained LLMs. Hyperparameter optimization and alignment of fine-tuning tasks with pretraining objectives are vital.
Memory-Efficient DP-Optimized Transformers:
- The authors present a novel memory-saving technique termed "ghost clipping." This approach enables DP-SGD to operate without instantiating per-example gradients for any linear layer in a model. Such a technique allows DP-training of LLMs at nearly the same memory costs as non-private methods with a moderate runtime overhead.
Empirical Results Against Established Baselines:
- Contrary to the belief that DP optimization falters with high-dimensional data, the paper finds that pretrained models, when fine-tuned with the proposed methods, match or surpass state-of-the-art non-private models and improve over models trained under heuristic privacy notions. This is showcased across multiple NLP tasks like sentence classification and language generation.
Analysis of Gradient Update Dimensionality:
- The analysis indicates that the previously assumed dimensionality issues in gradient updates do not severely impact DP fine-tuning. Larger pretrained models tend to perform better, and parameter-efficient methods with reduced update dimensionality do not consistently outperform full fine-tuning.
Encouragement for Practical Deployment:
- By presenting an effective DP strategy for LLMs, the research opens pathways for developing private NLP applications feasible on smaller datasets, aligning privacy goals with practical deployment.

Implications and Future Directions

Broader Utilization of LLMs with Privacy:

The findings suggest that industries can leverage large models for privacy-preserving applications by reducing dependency on large datasets through transfer learning and fine-tuning, accommodating stringent privacy requirements.

Fine-Tuning Hyperparameters:

Future work could focus on the further refinement of hyperparameter settings, such as weight decay and learning rate schedules, to finely balance computation and privacy costs across diverse NLP tasks.

Creating Curated Public Datasets:

The paper acknowledges concerns with current pretraining datasets. Efforts should be directed at curating datasets that respect privacy from the collection phase, enhancing the trust in pretrained model repositories.

Exploration of Scaling Laws:

Building on the insights from this work, the exploration of scaling laws specific to DP in deep learning models could offer robust guidelines for efficiently trading-off model size, compute budget, and privacy levels.

The research provides a significant step toward integrating robust privacy guarantees in the growing deployment of LLMs in sensitive applications, setting the stage for more widespread and responsible use of artificial intelligence in consumer-centric services.

PDF Markdown

Related Papers

GitHub

GitHub - lxuechen/private-transformers: A codebase that makes differentially private training of transformers easy. (147 stars)

Tweets

https://twitter.com/ChulinXie/status/1768721262363234471