Multilingual Jailbreak Challenges in Large Language Models (2310.06474v3)
Abstract: While LLMs exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate LLMs to exhibit undesirable behavior. Although several preventive measures have been developed to mitigate the potential risks associated with LLMs, they have primarily focused on English. In this study, we reveal the presence of multilingual jailbreak challenges within LLMs and consider two potential risky scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack LLMs. The experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. Specifically, low-resource languages exhibit about three times the likelihood of encountering harmful content compared to high-resource languages, with both ChatGPT and GPT-4. In the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for ChatGPT and 40.71\% for GPT-4. To handle such a challenge in the multilingual context, we propose a novel \textsc{Self-Defense} framework that automatically generates multilingual training data for safety fine-tuning. Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation. Data is available at \url{https://github.com/DAMO-NLP-SG/multilingual-safety-for-LLMs}.
- Anthropic. Model card and evaluations for claude models. 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862, 2022. URL https://doi.org/10.48550/arXiv.2204.05862.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023, 2023. URL https://doi.org/10.48550/arXiv.2302.04023.
- Red-teaming large language models using chain of utterances for safety-alignment. ArXiv, abs/2308.09662, 2023. URL https://doi.org/10.48550/arXiv.2308.09662.
- A survey on adversarial attacks and defences. CAAI Trans. Intell. Technol., 6(1):25–45, 2021. URL https://doi.org/10.1049/cit2.12028.
- A survey on evaluation of large language models. CoRR, abs/2307.03109, 2023. URL https://doi.org/10.48550/arXiv.2307.03109.
- How is chatgpt’s behavior changing over time? CoRR, abs/2307.09009, 2023. URL https://doi.org/10.48550/arXiv.2307.09009.
- ChatGPT Goes to Law School, 2023. URL https://papers.ssrn.com/abstract=4335905.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
- XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 2475–2485. Association for Computational Linguistics, 2018. URL https://doi.org/10.18653/v1/d18-1269.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. CoRR, abs/2307.08715, 2023. URL https://doi.org/10.48550/arXiv.2307.08715.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858, 2022. URL https://doi.org/10.48550/arXiv.2209.07858.
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 3309–3326, 2022. URL https://doi.org/10.18653/v1/2022.acl-long.234.
- Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. CoRR, abs/2305.06972, 2023. URL https://doi.org/10.48550/arXiv.2305.06972.
- Is chatgpt A good translator? A preliminary study. CoRR, abs/2301.08745, 2023. URL https://doi.org/10.48550/arXiv.2301.08745.
- Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. CoRR, abs/2304.05613, 2023. URL https://doi.org/10.48550/arXiv.2304.05613.
- Multi-step jailbreaking privacy attacks on chatgpt. CoRR, abs/2304.05197, 2023. URL https://doi.org/10.48550/arXiv.2304.05197.
- Common sense beyond english: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 1274–1287. Association for Computational Linguistics, 2021. URL https://doi.org/10.18653/v1/2021.acl-long.102.
- Jailbreaking chatgpt via prompt engineering: An empirical study. CoRR, abs/2305.13860, 2023. URL https://doi.org/10.48550/arXiv.2305.13860.
- A holistic approach to undesired content detection in the real world. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pp. 15009–15018. AAAI Press, 2023. URL https://doi.org/10.1609/aaai.v37i12.26752.
- OpenAI. Chatgpt. 2023a. URL https://openai.com/chatgpt.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023b. URL https://doi.org/10.48550/arXiv.2303.08774.
- Training language models to follow instructions with human feedback. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
- Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 3419–3448, 2022. URL https://doi.org/10.18653/v1/2022.emnlp-main.225.
- Is chatgpt a general-purpose natural language processing task solver? CoRR, abs/2302.06476, 2023. URL https://doi.org/10.48550/arXiv.2302.06476.
- Exploring new frontiers in agricultural NLP: investigating the potential of large language models for food applications. CoRR, abs/2306.11892, 2023. URL https://doi.org/10.48550/arXiv.2306.11892.
- ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. CoRR, abs/2308.03825, 2023. URL https://doi.org/10.48550/arXiv.2308.03825.
- Large language models encode clinical knowledge. CoRR, abs/2212.13138, 2022. URL https://doi.org/10.48550/arXiv.2212.13138.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023. URL https://doi.org/10.48550/arXiv.2307.09288.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 13484–13508, 2023. URL https://doi.org/10.18653/v1/2023.acl-long.754.
- Jailbroken: How does LLM safety training fail? CoRR, abs/2307.02483, 2023. URL https://doi.org/10.48550/arXiv.2307.02483.
- Challenges in detoxifying language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pp. 2447–2469. Association for Computational Linguistics, 2021. URL https://doi.org/10.18653/v1/2021.findings-emnlp.210.
- GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. CoRR, abs/2308.06463, 2023. URL https://doi.org/10.48550/arXiv.2308.06463.
- M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. CoRR, abs/2306.05179, 2023a. URL https://doi.org/10.48550/arXiv.2306.05179.
- Sentiment analysis in the era of large language models: A reality check. CoRR, abs/2305.15005, 2023b. URL https://doi.org/10.48550/arXiv.2305.15005.
- Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023. URL https://doi.org/10.48550/arXiv.2307.15043.