Emergent Mind

Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?

(2404.03411)
Published Apr 4, 2024 in cs.LG , cs.CL , and cs.CR

Abstract

Various jailbreak attacks have been proposed to red-team LLMs and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal LLMs (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found here https://anonymous.4open.science/r/red_teaming_gpt4-C1CE/README.md .

Overview

  • This paper investigates the robustness of GPT-4 and GPT-4V against jailbreak attacks, focusing on both textual and visual methods across an array of LLMs and Multimodal LLMs (MLLMs).

  • A comprehensive jailbreak evaluation dataset comprising 1445 questions across 11 safety policies is introduced to provide a fair benchmark for assessing model performance.

  • Key findings reveal GPT-4 and GPT-4V's superior robustness compared to open-source alternatives, with differences noted in the transferability of jailbreak methods.

  • The study highlights the importance of advancing safety measures for LLMs and MLLMs and suggests potential avenues for future research in counteracting jailbreak attacks.

Comprehensive Examination of Jailbreak Attacks Against GPT-4 and Multimodal Language Models

Introduction to Jailbreak Attacks

Jailbreak attacks on LLMs and Multimodal LLMs (MLLMs) pose significant risks as they can elicit harmful or unethical responses from models designed to avoid generating such content. This work explore evaluating the robustness of state-of-the-art (SOTA) proprietary and open-source models, including GPT-4 and GPT-4V, against an array of textual and visual jailbreak attack methods. Prior methodologies lack a universal benchmark for fair performance comparison, and there is a notable scarcity in the comprehensive assessment of commercial, top-tier models against jailbreak attacks. This gap is bridged by introducing a meticulously curated jailbreak evaluation dataset comprising 1445 questions spread across 11 different safety policies. The investigation extends to 11 different LLMs and MLLMs, revealing nuances in model robustness and method transferability.

Dataset and Experimentation Framework

To scaffold a universal evaluation framework, a broad and diverse jailbreak dataset was assembled from existing literature, covering a spectrum of harmful behaviors and questions across 11 varied safety policies. The dataset serves as the foundation for exhaustive red-teaming experiments on both proprietary (GPT-4, GPT-4V) and open-source models (Llama2, MiniGPT4). Techniques employed in these experiments ranged from hand-crafted modifications to sophisticated optimization-based attacks, aiming to skirt around the models' built-in safety measures.

Key Findings from Red-Teaming Experiments

Model Robustness Against Jailbreak Attacks

  • GPT-4 and GPT-4V exhibit superior robustness over their open-source counterparts, displaying a lower susceptibility to both textual and visual jailbreak methods.
  • Among the open-source models assessed, Llama2 emerges as notably robust, presenting a compelling case for its safety alignment training, despite being more vulnerable to certain automatic jailbreak methods than GPT-4.
  • Transferability of Jailbreak Methods: The study found that textual modification methods, such as AutoDAN, showcased a higher degree of transferability than visual methods when employed against different models.

Insights into Jailbreak Methodologies

  • No singular jailbreak method proved universally dominant across all models tested, underscoring the diversity in model vulnerabilities and the nuanced nature of model defenses.
  • Visual jailbreak methods, despite their conceptual appeal, demonstrated limited efficacy against GPT-4V, hinting at robust underlying mechanisms to counter such attacks.

Implications and Future Directions

The differential robustness of proprietary models like GPT-4 and GPT-4V compared to open-source variants underscores a significant gap that merits further exploration. Specifically, the study illuminates the critical need for advancing safety regulations and defenses in LLMs and MLLMs, especially as these models become increasingly integrated into real-world applications. Additionally, it hints at the potential for future work to focus on refining visual jailbreak methodologies and exploring more sophisticated transferability mechanisms.

The profound insights garnered from the extensive red-teaming effort offer a granular view into the current state of model vulnerabilities and defenses against jailbreak attacks. Moving forward, this work will undoubtedly spur further research into developing more resilient AI models, driving the evolution of effective countermeasures against evolving threats in the rapidly advancing landscape of LLMs and MLLMs.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.