Red Teaming Visual Language Models (2401.12915v1)
Abstract: VLMs (Vision-LLMs) extend the capabilities of LLMs to accept multimodal inputs. Since it has been verified that LLMs can be induced to generate harmful or inaccurate content through specific test cases (termed as Red Teaming), how VLMs perform in similar scenarios, especially with their combination of textual and visual inputs, remains a question. To explore this problem, we present a novel red teaming dataset RTVLM, which encompasses 10 subtasks (e.g., image misleading, multi-modal jail-breaking, face fairness, etc) under 4 primary aspects (faithfulness, privacy, safety, fairness). Our RTVLM is the first red-teaming dataset to benchmark current VLMs in terms of these 4 different aspects. Detailed analysis shows that 10 prominent open-sourced VLMs struggle with the red teaming in different degrees and have up to 31% performance gap with GPT-4V. Additionally, we simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning (SFT) using RTVLM, and this bolsters the models' performance with 10% in RTVLM test set, 13% in MM-Hal, and without noticeable decline in MM-Bench, overpassing other LLaVA-based models with regular alignment data. This reveals that current open-sourced VLMs still lack red teaming alignment. Our code and datasets will be open-source.
- Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Introducing our multimodal models.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In Proceedings of the 28th USENIX Conference on Security Symposium, pages 267–284. USENIX Association.
- Sharegpt4v: Improving large multi-modal models with better captions.
- Pali-x: On scaling up a multilingual vision and language model. ArXiv, abs/2305.18565.
- Can language models be instructed to protect personal information?
- Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv preprint, abs/2305.06500.
- A survey for in-context learning.
- Bias and fairness in large language models: A survey.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
- Chatgpt outperforms crowd-workers for text-annotation tasks. ArXiv preprint, abs/2303.15056.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
- Reducing sentiment bias in language models via counterfactual evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 65–83.
- A hierarchical approach for generating descriptive image paragraphs.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv preprint, abs/2301.12597.
- Silkie: Preference distillation for large visual language models.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387.
- Truthfulqa: Measuring how models mimic human falsehoods.
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565.
- Improved baselines with visual instruction tuning.
- Visual instruction tuning. ArXiv preprint, abs/2304.08485.
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).
- Stable bias: Analyzing societal representations in diffusion models.
- OpenAI. 2023. Gpt-4v(ision) system card.
- Instruction tuning with gpt-4. ArXiv preprint, abs/2304.03277.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286.
- True few-shot learning with language models. arXiv.
- Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Aligning large multimodal models with factually augmented rlhf.
- Knowledge mining with scene text for fine-grained recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4614–4623.
- Large language models are not fair evaluators.
- Self-instruct: Aligning language models with self-generated instructions.
- Gpt-4v(ision) as a generalist evaluator for vision-language tasks.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915.
- Mquake: Assessing knowledge editing in language models via multi-hop questions.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint, abs/2304.10592.
- Universal and transferable adversarial attacks on aligned language models.