Jailbreaking Attack against Multimodal Large Language Model (2402.02309v1)

Published 4 Feb 2024 in cs.LG, cs.CL, cs.CR, and cs.CV

Abstract: This paper focuses on jailbreaking attacks against multi-modal LLMs (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by LLMs may be offensive to some readers.}

Citations (35)

View on Semantic Scholar

Summary

The paper reveals that a crafted image Jailbreaking Prompt reliably induces harmful outputs in multimodal large language models.
It employs a maximum likelihood-based framework that achieves high attack success rates and notable model-transferability in black-box settings.
The study introduces a construction-based method that links MLLM vulnerabilities to LLM contexts, urging further advancements in AI safety.

Exploring the Permeability of Multimodal LLMs to Jailbreaking Attacks

Introduction to Multimodal LLMs (MLLMs) Jailbreaking

Multimodal LLMs (MLLMs) incorporate not just textual information but also visual inputs, enhancing their applicability across a wider range of scenarios than their solely text-based counterparts. However, this integration of visual perception introduces vulnerabilities, significantly complicating the models' alignment with ethical guidelines and safety standards. Recent discussions have spotlighted the potential for these models to be "jailbroken", or manipulated into generating outputs that deviate markedly from intended or safe responses. This paper presents a formal examination of jailbreaking MLLMs, focusing on a crafted image Jailbreaking Prompt (imgJP) approach that prompts MLLMs to produce objectionable content in response to harmful queries.

Jailbreaking Techniques and Efficiency

The proposed method centers on a maximum likelihood-based framework to discover a specific imgJP. When this imgJP accompanies a harmful request, it induces the MLLM to yield prohibited or harmful content. The effectiveness of this method has been demonstrated through its application across various models, such as MiniGPT-v2 and LLaVA, in a black-box manner, meaning without specific knowledge of the model’s internal workings. Additionally, the paper bridges the gap between MLLM and LLM jailbreaks, showcasing a pathway through which insights from MLLM vulnerabilities can translate into LLM contexts, achieving superior efficiency compared to current LLM-jailbreaking methodologies.

Universal Jailbreaking and Model-transferability

A striking finding of this investigation is the strong data-universal property of the imgJP, indicating a single perturbation can effect jailbreak across multiple unseen prompts and images. This attribute underscores the scalability of the attack and its potential widespread application. Furthermore, the paper reveals notable model-transferability, where imgJPs developed on one MLLM can perturb others effectively, a critical discovery for understanding and potentially mitigating black-box attacks on MLLMs.

Construction-based Method for LLM-jailbreaks

An innovative contribution is the construction-based method for translating MLLM vulnerabilities into LLM contexts. By embedding an optimal imgJP into text space, this method circumvents the inefficiencies of discrete optimization prevalent in LLM-jailbreak attempts, offering a streamlined, effective jailbreaking strategy. This method's potency is underscored by its ability to achieve high Attack Success Rates (ASR) with a minimal pool of reversed text Jailbreaking Prompts (txtJP), marking a significant advancement over existing approaches.

Implications for AI Safety and Future Directions

This paper's revelations emphasize the pressing need to advance our understanding of AI safety, particularly in MLLMs. The demonstrated ability to jailbreak such models across various contexts and models speaks to potential risks in deploying these technologies without robust safeguards. The paper also sets a precedent for future research to explore mechanisms of resistance against such attacks. Further examination into the construction-based method for text-based models might yield insights into developing more resilient AI models aligned with ethical and safety standards.

In conclusion, the findings reflect both the vulnerabilities inherent in MLLMs and the burgeoning techniques to exploit them. While the paper does not advocate for the malicious application of these insights, it underscores a critical challenge in AI development: reinforcing the alignment and safety of models in the face of evolving adversarial tactics. The anticipation of ongoing work in this field is to not only bolster the security of AI systems but also to refine their ethical alignment for responsible utilization.

PDF Markdown

Related Papers

Tweets

https://twitter.com/StephenLCasper/status/1780370650835886116