EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models (2403.12171v1)
Abstract: Jailbreak attacks are crucial for identifying and mitigating the security vulnerabilities of LLMs. They are designed to bypass safeguards and elicit prohibited outputs. However, due to significant differences among various jailbreak methods, there is no standard implementation framework available for the community, which limits comprehensive security evaluations. This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against LLMs. It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. This modular framework enables researchers to easily construct attacks from combinations of novel and existing components. So far, EasyJailbreak supports 11 distinct jailbreak methods and facilitates the security validation of a broad spectrum of LLMs. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks. Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively. We have released a wealth of resources for researchers, including a web platform, PyPI published package, screencast video, and experimental outputs.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Defending against alignment-breaking attacks via robustly aligned llm. ArXiv, abs/2309.14348.
- Jailbreaking black box large language models in twenty queries.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
- A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Llm self defense: By self examination, llms know they are being tricked. ArXiv, abs/2308.07308.
- Baseline defenses for adversarial attacks against aligned language models. ArXiv, abs/2309.00614.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- Deepinception: Hypnotize large language model to be jailbreaker.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- Codechameleon: Personalized encryption framework for jailbreaking large language models. arXiv preprint arXiv:2402.16717.
- Tree of attacks: Jailbreaking black-box llms automatically.
- Smoothllm: Defending large language models against jailbreaking attacks. ArXiv, abs/2310.03684.
- Fast adversarial attacks on language models in one gpu minute.
- Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. ArXiv, abs/2308.03825.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Jailbroken: How does llm safety training fail?
- Jailbreak and guard aligned language models with only few in-context demonstrations.
- Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949.
- Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. arXiv preprint arXiv:2309.05274.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Universal and transferable adversarial attacks on aligned language models.