How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs (2311.16101v1)

Published 27 Nov 2023 in cs.CV, cs.CL, and cs.LG

Abstract: This work focuses on the potential of Vision LLMs (VLLMs) in visual reasoning. Different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (OOD) generalization and adversarial robustness. For the OOD evaluation, we present two novel VQA datasets, each with one variant, designed to test model performance under challenging conditions. In exploring adversarial robustness, we propose a straightforward attack strategy for misleading VLLMs to produce visual-unrelated responses. Moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language component of VLLMs. Our evaluation of 21 diverse models, ranging from open-source VLLMs to GPT-4V, yields interesting observations: 1) Current VLLMs struggle with OOD texts but not images, unless the visual information is limited; and 2) These VLLMs can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. We release this safety evaluation suite at https://github.com/UCSC-VLAA/vLLM-safety-benchmark.

References (63)

Citations (52)

View on Semantic Scholar

Summary

The paper introduces a two-pronged evaluation framework combining OOD tests and redteaming attacks to assess the safety of Vision LLMs in varied scenarios.
It leverages two novel VQA datasets—OODCV-VQA and Sketchy-VQA—to systematically challenge model responses to atypical visual and textual inputs.
Findings reveal robust performance with visual inputs yet significant vulnerabilities with textual and sketch data, highlighting the need for enhanced safety protocols.

Overview of Safety Evaluation Benchmark for Vision LLMs

The paper, How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs, outlines a nuanced approach to assessing the safety and robustness of Vision LLMs (VLLMs). This work differentiates itself from previous evaluations by emphasizing safety through a comprehensive suite of tests, covering both out-of-distribution (OOD) scenarios and adversarial robustness. The authors provide a detailed inspection of how VLLMs respond to unconventional inputs, aiming to ensure their secure integration into real-world applications.

Methodology

The paper introduces a two-pronged safety evaluation framework for VLLMs:

Out-of-Distribution (OOD) Evaluation: The authors developed two novel Visual Question Answering (VQA) datasets—OODCV-VQA and Sketchy-VQA—each with a variant. These datasets are designed to test VLLMs' performance when faced with atypical visual inputs. The OODCV-VQA set includes images with unusual textures or rarely seen objects. Its variant introduces counterfactual descriptions to further challenge the models' comprehension abilities. Conversely, Sketchy-VQA focuses on sketch images, assessing models' ability to interpret minimalistic and abstract visual representations. The variant uses less common categories to enhance the difficulty.
Redteaming Attacks: The paper also evaluates adversarial robustness of VLLMs through redteaming strategies. A novel attack method, targeting the vision encoder of VLLMs based on CLIP ViTs, is proposed to mislead models into generating irrelevant outputs. Furthermore, the authors test the efficacy of jailbreaking strategies to induce toxic outputs, assessing vulnerabilities in the vision or language components of these models.

Key Findings

The paper evaluates 21 VLLMs, including prominent ones like GPT-4V, through their proposed framework and provides several critical insights:

VLLMs demonstrate robust performance with OOD visual inputs but struggle significantly with OOD textual inputs. This highlights the importance of language inputs in determining the functionality of VLLMs.
Current VLLMs, including GPT-4V, face challenges in interpreting sketches, suggesting limitations in their ability to process abstract or minimalist visual information.
The proposed CLIP ViT-based attacks are highly effective, revealing that most VLLMs can be misled or fail to reject misleading inputs.
Current methods for vision-based jailbreaking aren't universally effective. Simple misleading attempts produce confused outputs but aren't consistently able to invoke specific toxic content.
Vision-language training processes appear to undermine established safety protocols in LLMs, with most vision-LLMs exhibiting weaker defensive capabilities compared to their purely LLM counterparts.

Implications and Future Developments

The research underscores significant implications for the application of VLLMs in real-world environments. The revealed weaknesses in handling OOD data and adversarial inputs highlight critical areas where VLLM technology could be vulnerable. Future research must focus on enhancing safety protocols during the vision-language training phase to mitigate these vulnerabilities.

Moreover, as the integration of VLLMs becomes more prevalent across applications, the development of more robust methodologies for evaluating model safety is imperative. Ensuring the alignment of VLLMs with rigorous safety standards is particularly important, not only for technical robustness but also for maintaining ethical standards as these models interact more deeply with users in various societal contexts.

This paper contributes significantly to the discourse by shedding light on these areas and advocating for comprehensive safety evaluation frameworks that can adapt alongside advancements in VLLM technology. The release of the proposed benchmark dataset will undoubtedly serve as a valuable resource for the ongoing development and fortification of VLLMs in AI research.

PDF Markdown

GitHub

GitHub - UCSC-VLAA/vllm-safety-benchmark: Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs" (81 stars)