Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? (2404.03411v2)
Abstract: Various jailbreak attacks have been proposed to red-team LLMs and revealed the vulnerable safeguards of LLMs. Besides, some methods are not limited to the textual modality and extend the jailbreak attack to Multimodal LLMs (MLLMs) by perturbing the visual input. However, the absence of a universal evaluation benchmark complicates the performance reproduction and fair comparison. Besides, there is a lack of comprehensive evaluation of closed-source state-of-the-art (SOTA) models, especially MLLMs, such as GPT-4V. To address these issues, this work first builds a comprehensive jailbreak evaluation dataset with 1445 harmful questions covering 11 different safety policies. Based on this dataset, extensive red-teaming experiments are conducted on 11 different LLMs and MLLMs, including both SOTA proprietary models and open-source models. We then conduct a deep analysis of the evaluated results and find that (1) GPT4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs. (2) Llama2 and Qwen-VL-Chat are more robust compared to other open-source models. (3) The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods. The dataset and code can be found https://github.com/chenxshuo/RedTeamingGPT4V
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- AdeptAI. Fuyu-8b model card, 2024. https://huggingface.co/adept/fuyu-8b [Accessed: (2024.2.10)].
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
- Are aligned neural networks adversarially aligned? In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Josephus Cheung. Guanaco - generative universal assistant for natural-language adaptive context-aware omnilingual outputs, 2024. https://huggingface.co/JosephusCheung/Guanaco [Accessed: (2024.2.10)].
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, 2022.
- Figstep: Jailbreaking large vision-language models via typographic visual prompts. arXiv preprint arXiv:2311.05608, 2023.
- Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023.
- Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=jTiJPDv82w.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023b.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023c.
- OpenAI. Gpt model documentation, 2024. https://platform.openai.com/docs/models/overview [Accessed: (2024.2.10)].
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, 2023.
- Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
- Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. arXiv preprint arXiv:2307.14539, 2023.
- ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950, 2023a.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.