Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue (2402.17262v2)

Published 27 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." Research on jailbreak has highlighted the safety issues of LLMs. However, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from LLMs. In this paper, we argue that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information. LLMs may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. Therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced LLMs to answer harmful sub-questions incrementally, culminating in an overall harmful response. Our experiments, conducted across a wide range of LLMs, indicate current inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our findings expose vulnerabilities of LLMs in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36.
  7. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
  8. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  9. Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  10. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  11. Leveraging large language models in conversational recommender systems. arXiv preprint arXiv:2305.07961.
  12. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  13. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
  14. Vojtěch Hudeček and Ondřej Dušek. 2023. Are large language models all you need for task-oriented dialogue? In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 216–228.
  15. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
  16. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  17. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  18. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
  19. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
  20. Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of psychology.
  21. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  22. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  23. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  24. Red teaming language models with language models. arXiv preprint arXiv:2202.03286.
  25. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  26. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  27. Question decomposition improves the faithfulness of model-generated reasoning. arXiv preprint arXiv:2307.11768.
  28. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  29. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  30. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  32. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  34. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  35. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  36. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  37. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  38. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140.
  39. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
Citations (10)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.