LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification (2402.16515v1)
Abstract: As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.
- Deep learning with differential privacy. In CCS, pages 308–318.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Accountability Act. 1996. Health insurance portability and accountability act of 1996. Public law, 104:191.
- Large-scale differentially private bert. arXiv preprint arXiv:2108.01624.
- Borja Balle and Yu-Xiang Wang. 2018. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pages 394–403. PMLR.
- Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473.
- Individualized pate: Differentially private machine learning with individual privacy guarantees. Proceedings on Privacy Enhancing Technologies, 1:158–176.
- Deep learning with gaussian differential privacy. In Harvard Data Science Review, volume 2020. NIH Public Access.
- On the convergence and calibration of deep learning with differential privacy. Transactions on Machine Learning Research.
- On the convergence of deep learning with differential privacy. In arXiv preprint arXiv:2106.07830.
- Uncertainty-aware self-training for semi-supervised event temporal relation extraction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 2900–2904.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pages 267–284.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security, pages 267–284.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
- Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3).
- Local additivity based data augmentation for semi-supervised ner. arXiv preprint arXiv:2010.01677.
- A secure and efficient federated learning framework for nlp. arXiv preprint arXiv:2201.11934.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186.
- Gaussian differential privacy. arXiv preprint arXiv:1905.02383.
- Flocks of stochastic parrots: Differentially private prompt learning for large language models. arXiv preprint arXiv:2305.15594.
- Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407.
- The algorithmic foundations of differential privacy. In TCS, volume 9, pages 211–407.
- One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611.
- Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations. In Proceedings of the 13th international conference on web search and data mining, pages 178–186.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135.
- Simson Garfinkel et al. 2015. De-identification of Personal Information:. US Department of Commerce, National Institute of Standards and Technology.
- Augmenting data with mixup for sentence classification: An empirical study. arXiv preprint arXiv:1905.08941.
- Exploring the limits of differentially private deep learning with group-wise clipping. arXiv preprint arXiv:2212.01539.
- Learning data manipulation for augmentation and weighting. Advances in Neural Information Processing Systems, 32.
- Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342.
- Aaron Johnson and Vitaly Shmatikov. 2013. Privacy-preserving data exploration in genome-wide association studies. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1079–1087.
- Aeda: an easier data augmentation technique for text classification. arXiv preprint arXiv:2108.13230.
- Differentially private language models benefit from public pre-training. In Workshop in EMNLP, pages 39–45.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201.
- Learning active learning from data. Advances in neural information processing systems, 30.
- Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245.
- Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984.
- A hybrid medical text classification framework: Integrating attentive rule construction and neural network. Neurocomputing, 443:345–355.
- Large language models can be strong differentially private learners. In ICLR.
- How to choose" good" samples for text data augmentation. arXiv preprint arXiv:2302.00894.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Fast text anonymization using k-anonyminity. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services, pages 340–344.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR.
- Semi-supervised knowledge transfer for deep learning from private training data. In ICLR.
- A novel neural network-based method for medical text classification. Future Internet, 11(12):255.
- Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1):1–9.
- Learning in a large function space: Privacy-preserving mechanisms for svm learning. arXiv preprint arXiv:0911.5708.
- Selective differential privacy for language modeling. In arXiv preprint arXiv:2108.12944.
- Stochastic gradient descent with differentially private updates. In GlobalSIP, pages 245–248. IEEE.
- A k-anonymized text generation method. In Advances in Network-Based Information Systems: The 20th International Conference on Network-Based Information Systems (NBiS-2017), pages 1018–1026. Springer.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
- Seqpate: Differentially private text generation via knowledge distillation. Advances in Neural Information Processing Systems, 35:11117–11130.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Aleksei Triastcyn and Boi Faltings. 2020. Bayesian differential privacy for machine learning. In International Conference on Machine Learning, pages 9583–9592. PMLR.
- Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196.
- Privacy-preserving in-context learning for large language models. arXiv e-prints, pages arXiv–2305.
- Generative data augmentation for commonsense reasoning. arXiv preprint arXiv:2004.11546.
- Do not let privacy overbill utility: Gradient embedding perturbation for private learning. In ICLR.
- Synthetic text generation with differential privacy: A simple and practical recipe. arXiv preprint arXiv:2210.14348.
- Seqmix: Augmenting active sequence labeling via sequence mixup. arXiv preprint arXiv:2010.02322.
- Private-knn: Practical differential privacy for computer vision. In CVPR, pages 11854–11862.