Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective (2401.06824v5)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: The recent surge in jailbreaking attacks has revealed significant vulnerabilities in LLMs when exposed to malicious inputs. While various defense strategies have been proposed to mitigate these threats, there has been limited research into the underlying mechanisms that make LLMs vulnerable to such attacks. In this study, we suggest that the self-safeguarding capability of LLMs is linked to specific activity patterns within their representation space. Although these patterns have little impact on the semantic content of the generated text, they play a crucial role in shaping LLM behavior under jailbreaking attacks. Our findings demonstrate that these patterns can be detected with just a few pairs of contrastive queries. Extensive experimentation shows that the robustness of LLMs against jailbreaking can be manipulated by weakening or strengthening these patterns. Further visual analysis provides additional evidence for our conclusions, providing new insights into the jailbreaking phenomenon. These findings highlight the importance of addressing the potential misuse of open-source LLMs within the community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. David L. Barack and John W. Krakauer. 2021. Two views on the cognitive brain. Nature Reviews Neuroscience, page 359–371.
  2. Jailbreaker in jail: Moving target defense for large language models. In Proceedings of the 10th ACM Workshop on Moving Target Defense, pages 29–32.
  3. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  4. Masterkey: Automated jailbreaking of large language model chatbots.
  5. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  6. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268.
  7. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
  8. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  9. OpenAI. 2022. Introducing chatgpt.
  10. OpenAI OpenAI. 2023. Gpt-4 technical report.
  11. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  12. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  13. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pages 12–2.
  14. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
  15. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 1 like.

Upgrade to Pro to view all of the tweets about this paper: