Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective (2401.06824v5)
Abstract: The recent surge in jailbreaking attacks has revealed significant vulnerabilities in LLMs when exposed to malicious inputs. While various defense strategies have been proposed to mitigate these threats, there has been limited research into the underlying mechanisms that make LLMs vulnerable to such attacks. In this study, we suggest that the self-safeguarding capability of LLMs is linked to specific activity patterns within their representation space. Although these patterns have little impact on the semantic content of the generated text, they play a crucial role in shaping LLM behavior under jailbreaking attacks. Our findings demonstrate that these patterns can be detected with just a few pairs of contrastive queries. Extensive experimentation shows that the robustness of LLMs against jailbreaking can be manipulated by weakening or strengthening these patterns. Further visual analysis provides additional evidence for our conclusions, providing new insights into the jailbreaking phenomenon. These findings highlight the importance of addressing the potential misuse of open-source LLMs within the community.
- David L. Barack and John W. Krakauer. 2021. Two views on the cognitive brain. Nature Reviews Neuroscience, page 359–371.
- Jailbreaker in jail: Moving target defense for large language models. In Proceedings of the 10th ACM Workshop on Moving Target Defense, pages 29–32.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Masterkey: Automated jailbreaking of large language model chatbots.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
- A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- OpenAI. 2022. Introducing chatgpt.
- OpenAI OpenAI. 2023. Gpt-4 technical report.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pages 12–2.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.