MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance (2401.02906v3)
Abstract: The deployment of multimodal LLMs (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. This paper investigates the novel challenge of defending MLLMs against such attacks. Compared to LLMs, MLLMs include an additional image modality. We discover that images act as a ``foreign language" that is not considered during safety alignment, making MLLMs more prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover all possible scenarios. This vulnerability is exacerbated by the fact that most state-of-the-art MLLMs are fine-tuned on limited image-text pairs that are much fewer than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during safety fine-tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy that solves two subtasks: 1) identifying harmful responses via a lightweight harm detector, and 2) transforming harmful responses into harmless ones via a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the original performance of MLLMs. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Pythia: A suite for analyzing large language models across training and scaling.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Gaining wisdom from setbacks: Aligning large language models via mistake analysis.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575.
- Safe rlhf: Safe reinforcement learning from human feedback.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420.
- Hilm-d: Towards high-resolution understanding in multimodal large language models for autonomous driving. arXiv preprint arXiv:2309.05186.
- Exploring language hierarchy for video grounding. IEEE Transactions on Image Processing, 31:4693–4706.
- Raft: Reward ranked finetuning for generative foundation model alignment.
- Self-guided noise-free data generation for efficient zero-shot learning.
- G-llava: Solving geometric problem with multi-modal large language model.
- Llama-adapter v2: Parameter-efficient visual instruction model.
- Xinyang Geng and Hao Liu. 2023. Openllama: An open reproduction of llama.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173.
- Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Ai safety via debate. arXiv preprint arXiv:1805.00899.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192.
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
- Visual instruction tuning.
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
- Query-relevant images jailbreak large multi-modal models.
- Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
- Detgpt: Detect what you need via reasoning.
- Perceptiongpt: Effectively fusing visual perception into llm.
- Language models are unsupervised multitask learners.
- Direct preference optimization: Your language model is secretly a reward model.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- High-resolution image synthesis with latent diffusion models.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Pandagpt: One model to instruction-follow them all.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
- Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420.
- Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11.
- Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456.
- Zerogen: Efficient zero-shot learning via dataset generation. In Empirical Methods in Natural Language Processing.
- Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.