JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks (2404.03027v4)
Abstract: With the rapid advancements in Multimodal LLMs (MLLMs), securing these models against malicious inputs while aligning them with human values has emerged as a critical challenge. In this paper, we investigate an important and unexplored question of whether techniques that successfully jailbreak LLMs can be equally effective in jailbreaking MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering benchmark designed to assess the transferability of LLM jailbreak techniques to MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed in this paper, we generate 20, 000 text-based jailbreak prompts using advanced jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test cases across a spectrum of adversarial scenarios. Our evaluation of 10 open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks transferred from LLMs, highlighting a critical vulnerability in MLLMs that stems from their text-processing capabilities. Our findings underscore the urgent need for future research to address alignment vulnerabilities in MLLMs from both textual and visual inputs.
- Usage policies - openai. https://openai.com/policies/usage-policies, 2024. URL https://openai.com/policies/usage-policies. Accessed: 2024-01-12.
- Stability AI. Stable Diffusion XL Base 1.0: A Diffusion-based Text-to-Image Generative Model. https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0, 2023. Accessed: 2024-03-24.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Jailbreaking black box large language models in twenty queries, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Multilingual jailbreak challenges in large language models, 2023.
- Multilingual jailbreak challenges in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vESNKdEMGp.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model, 2024.
- How Robust is Google’s Bard to Adversarial Image Attacks? arXiv preprint arXiv:2309.11751, 2023.
- Glm: General language model pretraining with autoregressive blank infilling, 2022.
- Llama-adapter v2: Parameter-efficient visual instruction model, 2023.
- Figstep: Jailbreaking large vision-language models via typographic visual prompts, 2023.
- Pruning for protection: Increasing jailbreak resistance in aligned llms without fine-tuning, 2024.
- Efficient multimodal learning from data-centric perspective, 2024.
- Large multilingual models pivot zero-shot multimodal learning across languages, 2024.
- Catastrophic jailbreak of open-source llms via exploiting generation, 2023.
- Catastrophic jailbreak of open-source LLMs via exploiting generation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh.
- Hugging Face. all-mpnet-base-v2. https://huggingface.co/sentence-transformers/all-mpnet-base-v2, 2024. Accessed: 2024-01-19.
- Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
- Mistral 7b, 2023.
- Wikihow: A large scale text summarization dataset, 2018.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Images are achilles’ heel of alignment: Exploiting visual vulnerabilities for jailbreaking multimodal large language models, 2024a.
- RAIN: Your language models can align themselves without finetuning. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=pETSfWMUzy.
- Improved baselines with visual instruction tuning, 2023a.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024a.
- Mm-safetybench: A benchmark for safety evaluation of multimodal large language models, 2024b.
- Jailbreaking chatgpt via prompt engineering: An empirical study, 2023b.
- Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024.
- Meta AI. Llama 2 - acceptable use policy. https://ai.meta.com/llama/use-policy/, 2024. Accessed: 2024-01-19.
- Microsoft. Phi-2: A 2.7 billion parameter transformer model. https://huggingface.co/microsoft/phi-2, 2023. Accessed: 2024-03-04.
- Jailbreaking Attack against Multimodal Large Language Model. arXiv preprint arXiv:2402.02309, 2024.
- Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 21527–21536, 2024.
- Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models. arXiv preprint arXiv:2307.14539, 2023.
- Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=plmBsXHxgR.
- ”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
- InfiMM Team. Infimm: Advancing multimodal understanding from flamingo’s legacy through diverse llm integration, 2024. URL https://huggingface.co/Infi-MM/.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023a.
- The Vicuna Team. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023b. URL https://lmsys.org/blog/2023-03-30-vicuna/. Accessed: 2024-03-04.
- Baichuan Intelligent Technology. Baichuan-7b: An open-source large-scale pre-trained model. https://huggingface.co/baichuan-inc/Baichuan-7B, 2023. Accessed: 2024-03-04.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Zephyr: Direct distillation of lm alignment, 2023.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
- Learning to break the loop: Analyzing and mitigating repetitions for neural text generation, 2022.
- Cognitive overload: Jailbreaking large language models with overloaded logical thinking, 2024.
- Don’t listen to me: Understanding and exploring jailbreak prompts of large language models, 2024.
- GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=MbfAK4s61A.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms, 2024.
- Universal and transferable adversarial attacks on aligned language models, 2023.