Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? (2404.03411v2)

Published 4 Apr 2024 in cs.LG, cs.CL, and cs.CR

Abstract: Various jailbreak attacks have been proposed to red-team LLMs and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal LLMs (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found https://github.com/chenxshuo/RedTeamingGPT4V

References (35)

Citations (13)

View on Semantic Scholar

Summary

The paper introduces a comprehensive evaluation dataset with 1445 questions across 11 safety policies to assess jailbreak attacks on LLMs and MLLMs.
The paper finds that GPT-4 and GPT-4V offer superior robustness compared to open-source models, with textual methods showing higher transferability than visual attacks.
The paper highlights the need for enhanced safety regulations and defenses to mitigate evolving jailbreak techniques in multimodal language models.

Comprehensive Examination of Jailbreak Attacks Against GPT-4 and Multimodal LLMs

Introduction to Jailbreak Attacks

Jailbreak attacks on LLMs and Multimodal LLMs (MLLMs) pose significant risks as they can elicit harmful or unethical responses from models designed to avoid generating such content. This work explores evaluating the robustness of state-of-the-art (SOTA) proprietary and open-source models, including GPT-4 and GPT-4V, against an array of textual and visual jailbreak attack methods. Prior methodologies lack a universal benchmark for fair performance comparison, and there is a notable scarcity in the comprehensive assessment of commercial, top-tier models against jailbreak attacks. This gap is bridged by introducing a meticulously curated jailbreak evaluation dataset comprising 1445 questions spread across 11 different safety policies. The investigation extends to 11 different LLMs and MLLMs, revealing nuances in model robustness and method transferability.

Dataset and Experimentation Framework

To scaffold a universal evaluation framework, a broad and diverse jailbreak dataset was assembled from existing literature, covering a spectrum of harmful behaviors and questions across 11 varied safety policies. The dataset serves as the foundation for exhaustive red-teaming experiments on both proprietary (GPT-4, GPT-4V) and open-source models (Llama2, MiniGPT4). Techniques employed in these experiments ranged from hand-crafted modifications to sophisticated optimization-based attacks, aiming to skirt around the models' built-in safety measures.

Key Findings from Red-Teaming Experiments

Model Robustness Against Jailbreak Attacks

GPT-4 and GPT-4V exhibit superior robustness over their open-source counterparts, displaying a lower susceptibility to both textual and visual jailbreak methods.
Among the open-source models assessed, Llama2 emerges as notably robust, presenting a compelling case for its safety alignment training, despite being more vulnerable to certain automatic jailbreak methods than GPT-4.
Transferability of Jailbreak Methods: The paper found that textual modification methods, such as AutoDAN, showcased a higher degree of transferability than visual methods when employed against different models.

Insights into Jailbreak Methodologies

No singular jailbreak method proved universally dominant across all models tested, underscoring the diversity in model vulnerabilities and the nuanced nature of model defenses.
Visual jailbreak methods, despite their conceptual appeal, demonstrated limited efficacy against GPT-4V, hinting at robust underlying mechanisms to counter such attacks.

Implications and Future Directions

The differential robustness of proprietary models like GPT-4 and GPT-4V compared to open-source variants underscores a significant gap that merits further exploration. Specifically, the paper illuminates the critical need for advancing safety regulations and defenses in LLMs and MLLMs, especially as these models become increasingly integrated into real-world applications. Additionally, it hints at the potential for future work to focus on refining visual jailbreak methodologies and exploring more sophisticated transferability mechanisms.

The profound insights garnered from the extensive red-teaming effort offer a granular view into the current state of model vulnerabilities and defenses against jailbreak attacks. Moving forward, this work will undoubtedly spur further research into developing more resilient AI models, driving the evolution of effective countermeasures against evolving threats in the rapidly advancing landscape of LLMs and MLLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1776734779381285063

https://twitter.com/knishimae0531/status/1776875533856649687