Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 98 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge (2404.05880v2)

Published 8 Apr 2024 in cs.CL

Abstract: Jailbreaking attacks can enable LLMs to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model's own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609.
  4. Machine unlearning. In IEEE Symposium on Security and Privacy (SP), pages 141–159.
  5. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  7. Jailbreaker in jail: Moving target defense for large language models. In Proceedings of the 10th ACM Workshop on Moving Target Defense, pages 29–32.
  8. Jiaao Chen and Diyi Yang. 2023. Unlearn what you want to forget: Efficient unlearning for LLMs. In EMNLP.
  9. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
  10. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  11. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
  12. Attack prompt generation for red teaming and defending large language models. In EMNLP.
  13. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  14. Ronen Eldan and Mark Russinovich. 2023. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238.
  15. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
  16. Measuring massive multitask language understanding. ICLR.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  18. A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv preprint arXiv:2305.11391.
  19. Knowledge unlearning for mitigating privacy risks in language models. In ACL.
  20. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
  21. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  22. A holistic approach to undesired content detection in the real world. In AAAI, volume 37, pages 15009–15018.
  23. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  24. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  25. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  26. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI.
  27. Stanford alpaca: An instruction-following llama model.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  29. Self-guard: Empower the llm to safeguard itself. arXiv preprint arXiv:2310.15851.
  30. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  31. Large language model unlearning. arXiv preprint arXiv:2310.10683.
  32. Hellaswag: Can a machine really finish your sentence? In ACL.
  33. Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
  34. Making harmful behaviors unlearnable for large language models. arXiv preprint arXiv:2311.02105.
  35. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Citations (19)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.