Red Teaming Visual Language Models (2401.12915v1)

Published 23 Jan 2024 in cs.AI, cs.CL, and cs.CV

Abstract: VLMs (Vision-LLMs) extend the capabilities of LLMs to accept multimodal inputs. Since it has been verified that LLMs can be induced to generate harmful or inaccurate content through specific test cases (termed as Red Teaming), how VLMs perform in similar scenarios, especially with their combination of textual and visual inputs, remains a question. To explore this problem, we present a novel red teaming dataset RTVLM, which encompasses 10 subtasks (e.g., image misleading, multi-modal jail-breaking, face fairness, etc) under 4 primary aspects (faithfulness, privacy, safety, fairness). Our RTVLM is the first red-teaming dataset to benchmark current VLMs in terms of these 4 different aspects. Detailed analysis shows that 10 prominent open-sourced VLMs struggle with the red teaming in different degrees and have up to 31% performance gap with GPT-4V. Additionally, we simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning (SFT) using RTVLM, and this bolsters the models' performance with 10% in RTVLM test set, 13% in MM-Hal, and without noticeable decline in MM-Bench, overpassing other LLaVA-based models with regular alignment data. This reveals that current open-sourced VLMs still lack red teaming alignment. Our code and datasets will be open-source.

References (42)

Citations (20)

View on Semantic Scholar

Summary

The paper presents the RTVLM dataset as a benchmark for evaluating visual language model vulnerabilities in safety, fairness, privacy, and faithfulness.
The evaluation highlights a performance gap of up to 31% with GPT-4V, exposing critical weaknesses in current red teaming practices.
Fine-tuning with the RTVLM dataset improves safety and robustness without degrading overall performance, emphasizing the need for red teaming alignment.

Summary of the Red Teaming Visual LLMs Study

Introduction

The emergence of Vision-LLMs (VLMs), which combine the textual and visual processing capabilities of LLMs, has broadened the spectrum of AI applications. Despite the evident progress in VLMs, the lack of systematic red teaming benchmarks prompted the introduction of the Red Teaming Visual LLM (RTVLM) dataset. This newly constructed dataset assesses VLMs in areas crucial for secure deployment: Faithfulness, Safety, Privacy, and Fairness.

RTVLM Dataset Construction

RTVLM includes ten subtasks, each designed to target specific vulnerabilities within VLMs. The dataset ensures novelty by using images generated via diffusion techniques and human-annotated or GPT-4 generated questions. In evaluating Faithfulness, the dataset includes text and visual misleading tasks, and image order processing; Privacy is assessed through the distinction between public figures and private individuals, while Safety tests model responses to ethically risky inputs. For Fairness, VLMs are evaluated for bias towards individuals of varying races and genders.

Experimental Results

Upon evaluation, it was found that VLMs exhibit performance gaps in red teaming tasks and often lack red teaming alignment. The dataset served to benchmark and perform detailed analysis on 10 prominent VLMs, highlighting up to a 31% performance gap with GPT-4V. Incorporating RTVLM for Supervised Fine-tuning (SFT) into models like LLaVA-v1.5 improved performance significantly on the RTVLM test set and related benchmarks without degrading general performance, suggesting the necessity of incorporating red teaming alignment in the training process.

Red Teaming Alignment and Conclusions

The paper elucidates that current alignment practices in VLMs are insufficient when encountering red teaming scenarios. It also empirically demonstrates that directly aligning models with RTVLM improves both the safety and robustness of model outputs. The paper concludes by underscoring the importance of VLM security, and the RTVLM dataset is posited as a valuable asset for advancing model security measures.

PDF Markdown

Related Papers

Tweets

https://twitter.com/omarsar0/status/1750170364741640276

https://twitter.com/gm8xx8/status/1749978365661007891

https://twitter.com/knishimae0531/status/1750303489295826969