AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (2310.15140v2)
Abstract: Safety alignment of LLMs can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number due to the necessity of human creativity allows for easy blocking. In this paper, we show that these solutions may be too optimistic. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types. Guided by the dual goals of jailbreak and readability, AutoDAN optimizes and generates tokens one by one from left to right, resulting in readable prompts that bypass perplexity filters while maintaining high attack success rates. Notably, these prompts, generated from scratch using gradients, are interpretable and diverse, with emerging strategies commonly seen in manual jailbreak attacks. They also generalize to unforeseen harmful behaviors and transfer to black-box LLMs better than their unreadable counterparts when using limited training data or a single proxy model. Furthermore, we show the versatility of AutoDAN by automatically leaking system prompts using a customized objective. Our work offers a new way to red-team LLMs and understand jailbreak mechanisms via interpretability.
- Detecting language model attacks with perplexity. ArXiv, abs/2308.14132, 2023.
- Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp. 274–283. PMLR, 2018.
- Identifying and mitigating the security risks of generative AI. ArXiv, abs/2308.14840, 2023.
- Introduction to linear optimization, volume 6. Athena scientific Belmont, MA, 1997.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM, September 2023.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
- Jailbreaking Black Box Large Language Models in Twenty Queries, October 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- DAN. Chat gpt "dan" (and other "jailbreaks"), 2023. URL https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516. GitHub repository.
- MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots, October 2023.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Misusing Tools in Large Language Models With Visual Adversarial Examples, October 2023.
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, May 2023.
- Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5747–5757, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.464. URL https://aclanthology.org/2021.emnlp-main.464.
- Thilo Hagendorff. Deception abilities emerged in large language models. ArXiv, abs/2307.16513, 2023. URL https://api.semanticscholar.org/CorpusID:260334697. Citation Key: Hagendorff2023DeceptionAE.
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked. ArXiv, abs/2308.07308, 2023.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023a.
- Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation, October 2023b.
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models, September 2023.
- Automatically Auditing Large Language Models via Discrete Optimization, March 2023.
- Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2023.
- A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
- Certifying LLM Safety against Adversarial Prompting, September 2023.
- Open sesame! universal black box jailbreaking of large language models. ArXiv, September 2023. doi: 10.48550/arXiv.2309.01446. URL http://arxiv.org/abs/2309.01446. arXiv:2309.01446 [cs].
- MNIST handwritten digit database. Tech Report, 2010.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243.
- Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353.
- Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=iO4LZibEqW.
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models, October 2023a.
- Prompt Injection attack against LLM-integrated Applications, June 2023b.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study, May 2023c.
- Black Box Adversarial Prompting for Foundation Models. https://arxiv.org/abs/2302.04237v2, February 2023.
- Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–126, 2020.
- R OpenAI. Gpt-4 technical report. arXiv, pp. 2303–08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Ignore Previous Prompt: Attack Techniques For Language Models, November 2022.
- Automatic Prompt Optimization with "Gradient Descent" and Beam Search. ArXiv, October 2023. doi: 10.48550/arXiv.2305.03495.
- Visual Adversarial Examples Jailbreak Aligned Large Language Models, August 2023.
- Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks, May 2023.
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks, October 2023.
- Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2019.
- Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models, October 2023.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222–4235, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.346.
- On the Exploitability of Instruction Tuning. ArXiv, June 2023. doi: 10.48550/arXiv.2306.17194.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Florian Tramèr. Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them, June 2022.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023a.
- Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations, October 2023b.
- Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery, June 2023.
- Fundamental Limitations of Alignment in Large Language Models, August 2023.
- GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts, October 2023.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. ArXiv, August 2023. doi: 10.48550/arXiv.2308.06463. URL http://arxiv.org/abs/2308.06463. arXiv:2308.06463 [cs].
- Evaluating Large Language Models at Evaluating Instruction Following, October 2023.
- Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology, 11(3), April 2020. ISSN 2157-6904. doi: 10.1145/3374217. URL https://doi.org/10.1145/3374217.
- Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success, July 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, October 2023. doi: 10.48550/arXiv.2306.04528. URL http://arxiv.org/abs/2306.04528. arXiv:2306.04528 [cs].
- Representation engineering: A top-down approach to ai transparency. ArXiv, October 2023a. doi: 10.48550/arXiv.2310.01405. URL http://arxiv.org/abs/2310.01405. arXiv:2310.01405 [cs].
- Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023b.