Low-Resource Languages Jailbreak GPT-4 (2310.02446v2)

Published 3 Oct 2023 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: AI safety training and red-teaming of LLMs are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

References (49)

Citations (130)

View on Semantic Scholar

Summary

The paper demonstrates GPT-4's vulnerability in low-resource languages by translating unsafe English inputs, resulting in a 79% bypass rate.
It identifies imbalanced linguistic representation as a key factor undermining the model's safety protocols across different languages.
The study advocates for robust multilingual safety protocols to enhance AI security and ensure equitable performance across diverse linguistic contexts.

Assessment of Cross-Lingual Safety Vulnerabilities in GPT-4

The paper "Low-Resource Languages Jailbreak GPT-4" explores a critical aspect of AI safety regarding LLMs by examining vulnerabilities in GPT-4's safety mechanisms across different languages. The authors present a systematic analysis demonstrating that safety margin deficiencies caused by linguistic disparities pose significant security risks when translating unsafe inputs from English into low-resource languages.

The investigation involves translating unsafe English inputs into lesser-resourced languages using publicly available APIs like Google Translate. Evaluated on the AdvBench benchmark, these translated inputs had a 79% success rate in bypassing GPT-4's safeguards and eliciting harmful responses, rivaling even the most robust contemporary jailbreaking techniques. This suggests a pronounced vulnerability in GPT-4's cross-lingual safety measures that are inefficient in lower-resourced contexts compared to high/mid-resource language scenarios, where attack success rates were markedly lower.

The authors advance several compelling arguments and implications:

Cross-Lingual AI Vulnerability: The paper highlights that GPT-4 and, likely, other LLMs exhibit significant safety lapses when interfaced in low-resource languages. Historically, insufficient training data primarily affected accessibility and utility for speakers of low-resource languages. The findings, however, indicate a broader jeopardy—expanding the potential for model misuse across all language users. The ease of accessing automated translation services exacerbates this risk, enabling attackers to exploit safety loopholes in LLMs.
Imbalanced Linguistic Representation: This vulnerability underscores a persistent imbalance in AI safety and linguistic representation in model training. The research reveals that GPT-4's safety mechanisms fail to adequately generalize across languages, a shortcoming that the authors attribute to skewed priorities within AI alignment training. There is an evident need for more equitable and inclusive safety measures that ensure LLMs perform effectively across linguistic boundaries, with comprehensive coverage of low-resource languages.
Necessity for Multilingual Safety Protocols: The conclusion presses for an imperative expansion of red-teaming approaches beyond monolingual and predominantly English-centric frameworks. While current models may pass English-centric safety tests, the reality is that models like GPT-4 are deployed across multilingual platforms and use-cases, necessitating robust defenses against multi-lingual threat vectors. Therefore, developing datasets and benchmarks for multilingual safety assurance is crucial for establishing comprehensive security standards in AI models.

From these perspectives, the research evidently lays ground for heightened rigor in safety protocol development across diverse linguistic landscapes, ensuring LLMs like GPT-4 remain reliable and accountable in performance across varied user demographics. Future work in the area might involve a deeper investigation into the mechanisms behind the identified vulnerabilities in translation-based attacks and exploring scalable approaches to enhance safety across different LLMs without compromising performance or accessibility to disadvantaged linguistic populations.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SamShedden/status/1752626620806856931

https://twitter.com/yong_zhengxin/status/1806505442866311529

https://twitter.com/chinasza/status/1750231029753770046

https://twitter.com/SNBracken/status/1754548948545548485

Low-Resource Languages Jailbreak GPT-4 (2310.02446v2)

Summary

Assessment of Cross-Lingual Safety Vulnerabilities in GPT-4

Related Papers

Tweets

YouTube

HackerNews

Reddit