Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 28 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Don't Say No: Jailbreaking LLM by Suppressing Refusal (2404.16369v3)

Published 25 Apr 2024 in cs.CL

Abstract: Ensuring the safety alignment of LLMs is critical for generating responses consistent with human values. However, LLMs remain vulnerable to jailbreaking attacks, where carefully crafted prompts manipulate them into producing toxic content. One category of such attacks reformulates the task as an optimization problem, aiming to elicit affirmative responses from the LLM. However, these methods heavily rely on predefined objectionable behaviors, limiting their effectiveness and adaptability to diverse harmful queries. In this study, we first identify why the vanilla target loss is suboptimal and then propose enhancements to the loss objective. We introduce DSN (Don't Say No) attack, which combines a cosine decay schedule method with refusal suppression to achieve higher success rates. Extensive experiments demonstrate that DSN outperforms baseline attacks and achieves state-of-the-art attack success rates (ASR). DSN also shows strong universality and transferability to unseen datasets and black-box models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), 39–57. Ieee.
  3. 2023. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
  4. 2024. Masterkey: Automated jailbreaking of large language model chatbots. NDSS.
  5. 2017. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
  6. 2023. Figstep: Jailbreaking large vision-language models via typographic visual prompts.
  7. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
  8. 2021. Deberta: Decoding-enhanced bert with disentangled attention.
  9. 2018. Universal language model fine-tuning for text classification.
  10. 2023. Catastrophic jailbreak of open-source llms via exploiting generation.
  11. 2017. Adversarial machine learning at scale.
  12. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  13. 2019. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. arXiv preprint arXiv:1911.03860.
  14. 2024. Open the pandora’s box of llms: Jailbreaking llms through representation engineering.
  15. 2024. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms.
  16. 2016. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770.
  17. 2023. Autodan: Generating stealthy jailbreak prompts on aligned large language models.
  18. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249.
  19. 2022. Fast model editing at scale.
  20. 2024. Jailbreaking attack against multimodal large language model.
  21. 2016. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), 372–387. IEEE.
  22. 2022. Cold decoding: Energy-based constrained text generation with langevin dynamics. Advances in Neural Information Processing Systems 35:9538–9551.
  23. 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
  24. 2014. Intriguing properties of neural networks.
  25. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  26. 2023. jailbreakchat.com. http://jailbreakchat.com.
  27. 2023. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  28. 2019. Neural text generation with unlikelihood training.
  29. 2023. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668.
  30. 2023. Jade: A linguistics-based safety evaluation platform for large language models.
  31. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  32. 2023. Autodan: Interpretable gradient-based adversarial attacks on large language models.
  33. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
  34. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Citations (9)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com