Eight Methods to Evaluate Robust Unlearning in LLMs (2402.16835v1)
Abstract: Machine unlearning can be useful for removing harmful capabilities and memorized text from LLMs, but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). While WHP's unlearning generalizes well when evaluated with the "Familiarity" metric from Eldan and Russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) WHP performs on par with the original model on Harry Potter Q&A tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. Overall, our results highlight the importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
- Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159. IEEE, 2021.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pp. 463–480. IEEE, 2015.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Unlearn what you want to forget: Efficient unlearning for llms. arXiv preprint arXiv:2310.20150, 2023.
- Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357, 2022.
- Who’s harry potter? approximate unlearning in llms. ArXiv, abs/2310.02238, 2023. URL https://api.semanticscholar.org/CorpusID:263608437.
- Coercing llms to do and reveal (almost) anything, 2024.
- Corrective machine unlearning, 2024.
- Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
- Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
- Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp. 287–296, 2023.
- Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
- Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
- Knowledge sanitization of large language models. arXiv preprint arXiv:2309.11852, 2023.
- Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023.
- Knowledge unlearning for mitigating privacy risks in language models. arXiv preprint arXiv:2210.01504, 2022.
- Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
- Copyright violations and large language models. arXiv preprint arXiv:2310.13771, 2023.
- Understanding catastrophic forgetting in language models via implicit inference. arXiv preprint arXiv:2309.10105, 2023.
- Privacy adhering machine un-learning in nlp. arXiv preprint arXiv:2212.09573, 2022.
- A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024.
- Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023.
- Technical report for iccv 2021 challenge sslad-track3b: Transformers are better continual learners. arXiv preprint arXiv:2201.04924, 2022.
- Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? arXiv preprint arXiv:2312.03729, 2023a.
- Rethinking machine unlearning for large language models, 2024a.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023b.
- Towards safer large language models through machine unlearning, 2024b.
- Large language models relearn removed concepts. arXiv preprint arXiv:2401.01814, 2024.
- Investigating bias representations in llama 2 chat via activation steering, 2024.
- Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609, 2022.
- Mechanistic mode connectivity. In International Conference on Machine Learning, pp. 22965–23004. PMLR, 2023.
- Investigating forgetting in pre-trained representations through continual learning. arXiv preprint arXiv:2305.05968, 2023.
- Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024.
- A survey of machine unlearning. arXiv preprint arXiv:2209.02299, 2022.
- Can sensitive information be deleted from llms? objectives for defending against extraction attacks. arXiv preprint arXiv:2309.17410, 2023.
- In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023.
- Fine-tuning enhances existing mechanisms: A case study on entity tracking, 2024.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2021.
- Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
- Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
- J.K. Rowling. Harry potter series. Bloomsbury Publishing (UK), Scholastic Press (US), 1997-2007. Series includes: Harry Potter and the Sorcerer’s Stone (1997), Harry Potter and the Chamber of Secrets (1998), Harry Potter and the Prisoner of Azkaban (1999), Harry Potter and the Goblet of Fire (2000), Harry Potter and the Order of the Phoenix (2003), Harry Potter and the Half-Blood Prince (2005), and Harry Potter and the Deathly Hallows (2007).
- Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space, 2024.
- Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6107–6122, 2022.
- Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
- Exploring the landscape of machine unlearning: A comprehensive survey and taxonomy. arXiv preprint arXiv:2305.06360, 2023.
- Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023.
- ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
- Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- A language model’s guide through latent space, 2024.
- Kga: A general machine unlearning framework based on knowledge gap alignment. arXiv preprint arXiv:2305.06535, 2023.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Assessing the brittleness of safety alignment via pruning and low-rank modifications, 2024.
- Depn: Detecting and editing privacy neurons in pretrained language models. arXiv preprint arXiv:2310.20138, 2023.
- Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
- Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 6032–6048, 2023.
- Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
- Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv:2306.14870, 2023.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.