Learnable Privacy Neurons Localization in Language Models (2405.10989v1)
Abstract: Concerns regarding LLMs to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks. However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.
- Large-scale differentially private bert. arXiv preprint arXiv:2108.01624.
- Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34:10876–10889.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Knowledgeable or educated guess? revisiting language models as knowledge bases. arXiv preprint arXiv:2106.09231.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
- Neural legal judgment prediction in english. arXiv preprint arXiv:1906.02059.
- Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150.
- Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
- De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association, 24(3):596–606.
- Sensitive data detection and classification in spanish clinical text: Experiments with bert. arXiv preprint arXiv:2003.03106.
- Threats to pre-trained language models: Survey and taxonomy. arXiv preprint arXiv:2202.06862.
- Exploring the limits of differentially private deep learning with group-wise clipping. arXiv preprint arXiv:2212.01539.
- Benjamin Heinzerling and Kentaro Inui. 2020. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. arXiv preprint arXiv:2008.09036.
- Learning and evaluating a differentially private pre-trained language model. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1178–1189.
- Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628.
- Measuring forgetting of memorized training examples. arXiv preprint arXiv:2207.00099.
- Towards continual knowledge learning of language models. arXiv preprint arXiv:2110.03215.
- Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504.
- Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR.
- Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4360–4379.
- Bryan Klimt and Yiming Yang. 2004. Introducing the enron corpus. In CEAS, volume 45, pages 92–96.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
- Pmet: Precise model editing in a transformer. arXiv preprint arXiv:2308.08742.
- Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679.
- Anonymisation models for text data: State of the art, challenges and future directions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4188–4203.
- Learning sparse neural networks through l_0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312.
- Analyzing leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539.
- The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
- Can neural network memorization be localized?
- Differentially private decoding in large language models. arXiv preprint arXiv:2205.13621.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
- Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.
- OpenAI. 2023. Gpt-4: Generative pre-trained transformer 4. https://openai.com.
- How context affects language models’ factual predictions. arXiv preprint arXiv:2005.04611.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
- Protection Regulation. 2016. Regulation (eu) 2016/679 of the european parliament and of the council. Regulation (eu), 679:2016.
- Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1):1–9.
- Stefan Schweter and Alan Akbik. 2020. Flert: Document-level features for named entity recognition.
- Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
- Identifying and mitigating privacy risks stemming from language models: A survey. arXiv preprint arXiv:2310.01424.
- Stanford alpaca: An instruction-following llama model.
- Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Downstream task performance of bert models pre-trained using automatically de-identified clinical data. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4245–4252.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698.
- Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138.
- Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500.
- Large scale private learning via low-rank reparametrization. In International Conference on Machine Learning, pages 12208–12218. PMLR.
- Robust lottery tickets for pre-trained language models. arXiv preprint arXiv:2211.03013.