Jailbreaking Black Box Large Language Models in Twenty Queries (2310.08419v4)
Abstract: There is growing interest in ensuring that LLMs align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.
- Palm 2 technical report, 2023.
- Constitutional ai: Harmlessness from ai feedback, 2022.
- Improving question answering model robustness with synthetic adversarial data generation. arXiv preprint arXiv:2104.08678, 2021a.
- Models in the loop: Aiding crowdworkers with generative annotation assistants. arXiv preprint arXiv:2112.09062, 2021b.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Are aligned neural networks adversarially aligned?, 2023.
- Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019.
- Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
- Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083, 2019.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
- Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long.295.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Automatically auditing large language models via discrete optimization, 2023.
- Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR, 2023.
- Adversarial attacks and defences competition. In The NIPS’17 Competition: Building Intelligent Systems, pages 195–231. Springer, 2018.
- Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pages 656–672. IEEE, 2019.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Black box adversarial prompting for foundation models, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- Automatic prompt optimization with "gradient descent" and beam search, 2023.
- Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
- Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118, 2020.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Large language models in medicine. Nature medicine, pages 1–11, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Improving adversarial robustness requires revisiting misclassified examples. In International conference on learning representations, 2019.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Large language models are human-level prompt engineers, 2023.
- Universal and transferable adversarial attacks on aligned language models, 2023.