A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models (2402.13457v2)
Abstract: LLMs have increasingly become central to generating content with potential societal impacts. Notably, these models have demonstrated capabilities for generating content that could be deemed harmful. To mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. However, the phenomenon of "jailbreaking", where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. This research conducts a comprehensive analysis of existing studies on jailbreaking LLMs and their defense techniques. We meticulously investigate nine attack techniques and seven defense techniques applied across three distinct LLMs: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate the effectiveness of these attack and defense techniques. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. This research highlights the need to concentrate on the security facets of LLMs. Additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into LLM security. We believe these contributions will facilitate the exploration of security measures within this domain.
- Google AI. Google ai palm 2. https://ai.google/discover/palm2/. Accessed: [Insert Access Date Here].
- Automorphic. 2023. Aegis. https://github.com/automorphic-ai/aegis. Accessed: 2024-02-13.
- Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Analyzing the inherent response tendency of llms: Real-world instructions-driven jailbreak. arXiv preprint arXiv:2312.04127.
- X. fine tuned. 2024. FT-Roberta-LLM: A Fine-Tuned Roberta Large Language Model. https://huggingface.co/zhx123/ftrobertallm/tree/main.
- Gemini. Buy, sell & trade bitcoin & other crypto currencies with gemini’s platform. https://www.gemini.com/eu. Accessed: [Insert Access Date Here].
- Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
- Token-level adversarial prompt detection based on perplexity measures and contextual information. arXiv preprint arXiv:2311.11509.
- Catastrophic jailbreak of open-source LLMs via exploiting generation. In The Twelfth International Conference on Learning Representations.
- Hugging Face. 2023a. Meta llama. https://huggingface.co/meta-llama. Accessed: 2024-02-14.
- Hugging Face. 2023b. Vicuna 7b v1.5. https://huggingface.co/lmsys/vicuna-7b-v1.5. Accessed: 2024-02-14.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
- Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
- Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
- LMSYS. 2023. Vicuna 7b v1.5: A chat assistant fine-tuned on sharegpt conversations. https://huggingface.co/lmsys/vicuna-7b-v1.5. Accessed: [Insert access date here].
- Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119.
- Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pages 26106–26128. PMLR.
- OpenAI. 2023. Moderation guide. https://platform.openai.com/docs/guides/moderation. Accessed: 2024-02-13.
- OpenAI. 2023a. Openai pricing. https://openai.com/pricing. Accessed: 2024-02-14.
- OpenAI. 2023b. Research overview. https://openai.com/research/overview. Accessed: 2024-02-14.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- OWASP. 2023. OWASP Top 10 for LLM Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/.
- Jatmo: Prompt injection defense by task-specific finetuning. arXiv preprint arXiv:2312.17673.
- Bergeron: Combating adversarial attacks through a conscience-based alignment framework. arXiv preprint arXiv:2312.00029.
- ProtectAI. 2023. Llm-guard. https://github.com/protectai/llm-guard. Accessed: 2024-02-13.
- Hijacking large language models via adversarial in-context learning. arXiv preprint arXiv:2311.09948.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- Adversarial attacks and defenses in large language models: Old and new threats. arXiv preprint arXiv:2310.19737.
- Loft: Local proxy fine-tuning for improving transferability of adversarial attacks against large language model. arXiv preprint arXiv:2310.04445.
- Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492.
- Why do universal adversarial attacks work on large language models?: Geometry might be the answer. arXiv preprint arXiv:2309.00254.
- Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1):bbad493.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Defending chatgpt against jailbreak attack via self-reminder.
- Jailbreaking gpt-4v via self-adversarial attacks with system prompts. arXiv preprint arXiv:2311.09127.
- Cognitive overload: Jailbreaking large language models with overloaded logical thinking. arXiv preprint arXiv:2311.09827.
- Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. arXiv preprint arXiv:2309.05274.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.