DeepInception: Hypnotize Large Language Model to Be Jailbreaker (2311.03191v5)
Abstract: LLMs have succeeded significantly in various applications but remain susceptible to adversarial jailbreaks that void their safety guardrails. Previous attempts to exploit these vulnerabilities often rely on high-cost computational extrapolations, which may not be practical or efficient. In this paper, inspired by the authority influence demonstrated in the Milgram experiment, we present a lightweight method to take advantage of the LLMs' personification capabilities to construct $\textit{a virtual, nested scene}$, allowing it to realize an adaptive way to escape the usage control in a normal scenario. Empirically, the contents induced by our approach can achieve leading harmfulness rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs, $\textit{e.g.}$, Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o. The code and data are available at: https://github.com/tmlr-group/DeepInception.
- Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, 2023.
- On the opportunities and risks of foundation models. In arXiv, 2021.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
- Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023.
- Stanley Milgram. Behavioral study of obedience. The Journal of abnormal and social psychology, 67(4):371, 1963.
- Stanley Milgram. Obedience to authority: An experimental view. 1974. URL https://books.google.com.hk/books?id=MlpEAAAAMAAJ.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Jailbroken: How does llm safety training fail? In NeurIPS, 2023a.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023b.
- Defending chatgpt against jailbreak attack via self-reminder. Research Square, 2023.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.