Emergent Mind

Jailbreaking Attack against Multimodal Large Language Model

(2402.02309)
Published Feb 4, 2024 in cs.LG , cs.CL , cs.CR , and cs.CV

Abstract

This paper focuses on jailbreaking attacks against multi-modal LLMs (MLLMs), seeking to elicit MLLMs to generate objectionable responses to harmful user queries. A maximum likelihood-based algorithm is proposed to find an \emph{image Jailbreaking Prompt} (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images (i.e., data-universal property). Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2, in a black-box manner. Moreover, we reveal a connection between MLLM-jailbreaks and LLM-jailbreaks. As a result, we introduce a construction-based method to harness our approach for LLM-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. The code is available here. \textbf{Warning: some content generated by language models may be offensive to some readers.}

Overview

  • MLLMs integrate textual and visual information, enlarging their application but introducing vulnerabilities to being manipulated into producing harmful outputs through a jailbreaking prompt approach.

  • The study presents a maximum likelihood-based framework to identify a specific image Jailbreaking Prompt (imgJP) for inducing MLLMs to generate objectionable content, demonstrating effectiveness across various models in a black-box manner.

  • Findings include the data-universal property of imgJPs allowing scalability of the attack across models and contexts, and the model-transferability indicating the effectiveness of imgJPs developed on one MLLM on others.

  • A construction-based method for translating MLLM vulnerabilities to LLM contexts is introduced, offering a more efficient jailbreaking strategy and setting new directions for AI safety and the development of resistant AI technologies.

Exploring the Permeability of Multimodal LLMs to Jailbreaking Attacks

Introduction to Multimodal LLMs (MLLMs) Jailbreaking

Multimodal LLMs (MLLMs) incorporate not just textual information but also visual inputs, enhancing their applicability across a wider range of scenarios than their solely text-based counterparts. However, this integration of visual perception introduces vulnerabilities, significantly complicating the models' alignment with ethical guidelines and safety standards. Recent discussions have spotlighted the potential for these models to be "jailbroken", or manipulated into generating outputs that deviate markedly from intended or safe responses. This paper presents a formal examination of jailbreaking MLLMs, focusing on a crafted image Jailbreaking Prompt (imgJP) approach that prompts MLLMs to produce objectionable content in response to harmful queries.

Jailbreaking Techniques and Efficiency

The proposed method centers on a maximum likelihood-based framework to discover a specific imgJP. When this imgJP accompanies a harmful request, it induces the MLLM to yield prohibited or harmful content. The effectiveness of this method has been demonstrated through its application across various models, such as MiniGPT-v2 and LLaVA, in a black-box manner, meaning without specific knowledge of the model’s internal workings. Additionally, the study bridges the gap between MLLM and Large Language Model (LLM) jailbreaks, showcasing a pathway through which insights from MLLM vulnerabilities can translate into LLM contexts, achieving superior efficiency compared to current LLM-jailbreaking methodologies.

Universal Jailbreaking and Model-transferability

A striking finding of this investigation is the strong data-universal property of the imgJP, indicating a single perturbation can effect jailbreak across multiple unseen prompts and images. This attribute underscores the scalability of the attack and its potential widespread application. Furthermore, the study reveals notable model-transferability, where imgJPs developed on one MLLM can perturb others effectively, a critical discovery for understanding and potentially mitigating black-box attacks on MLLMs.

Construction-based Method for LLM-jailbreaks

An innovative contribution is the construction-based method for translating MLLM vulnerabilities into LLM contexts. By embedding an optimal imgJP into text space, this method circumvents the inefficiencies of discrete optimization prevalent in LLM-jailbreak attempts, offering a streamlined, effective jailbreaking strategy. This method's potency is underscored by its ability to achieve high Attack Success Rates (ASR) with a minimal pool of reversed text Jailbreaking Prompts (txtJP), marking a significant advancement over existing approaches.

Implications for AI Safety and Future Directions

This paper's revelations emphasize the pressing need to advance our understanding of AI safety, particularly in MLLMs. The demonstrated ability to jailbreak such models across various contexts and models speaks to potential risks in deploying these technologies without robust safeguards. The study also sets a precedent for future research to explore mechanisms of resistance against such attacks. Further examination into the construction-based method for text-based models might yield insights into developing more resilient AI models aligned with ethical and safety standards.

In conclusion, the findings reflect both the vulnerabilities inherent in MLLMs and the burgeoning techniques to exploit them. While the paper does not advocate for the malicious application of these insights, it underscores a critical challenge in AI development: reinforcing the alignment and safety of models in the face of evolving adversarial tactics. The anticipation of ongoing work in this field is to not only bolster the security of AI systems but also to refine their ethical alignment for responsible utilization.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.