LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models (2306.09265v1)

Published 15 Jun 2023 in cs.CV and cs.AI

Abstract: Large Vision-LLMs (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates $6$ categories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on $47$ standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario. Second, instruction-tuned LVLM with moderate instruction-following data may result in object hallucination issues (i.e., generate objects that are inconsistent with target images in the descriptions). It either makes the current evaluation metric such as CIDEr for image captioning ineffective or generates wrong answers. Third, employing a multi-turn reasoning evaluation framework can mitigate the issue of object hallucination, shedding light on developing an effective pipeline for LVLM evaluation. The findings provide a foundational framework for the conception and assessment of innovative strategies aimed at enhancing zero-shot multimodal techniques. Our LVLM-eHub will be available at https://github.com/OpenGVLab/Multi-Modality-Arena

References (89)

Citations (127)

View on Semantic Scholar

Summary

The paper introduces LVLM-eHub, a benchmark that evaluates eight LVLMs on six multimodal tasks using both quantitative metrics and human feedback.
It measures model performance in visual perception, knowledge acquisition, reasoning, and embodied intelligence, revealing benefits of fine-tuning and issues like object hallucination.
The findings advocate for evolving evaluation methods beyond standard metrics and emphasize enhancing instruction tuning for more robust multimodal models.

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-LLMs

The paper introduces LVLM-eHub, a robust benchmarking framework designed to systematically evaluate Large Vision-LLMs (LVLMs). The development of LVLMs has shown significant progress in integrating visual and textual data for diverse multimodal tasks, yet a comprehensive evaluation covering their full capabilities remains limited. This paper addresses this gap by presenting LVLM-eHub, evaluating both quantitative performance and qualitative human feedback.

The LVLM-eHub evaluates eight representative models, such as InstructBLIP and MiniGPT-4, focusing on six categories of capabilities: visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence. Evaluation is performed across 47 text-related visual benchmarks, offering a multifaceted understanding of LVLMs' strengths and challenges.

Key Findings

Visual Perception: LVLMs were assessed on tasks such as image classification, object counting, and multi-class identification. Results indicate that models like InstructBLIP, which have undergone extensive fine-tuning on domain-specific data, excel in these tasks, although they risk overfitting.
Visual Knowledge Acquisition: In tasks like OCR and image captioning, models utilizing large visual encoders and substantial instruction-tuning data, such as InstructBLIP, achieved superior performance, highlighting the impact of robust visual-textual alignment.
Visual Reasoning and Commonsense: For reasoning tasks, instruction-tuned models demonstrated success with multi-turn reasoning frameworks, underscoring the importance of effective evaluation schemes to reduce object hallucination.
Object Hallucination: The paper identifies a tendency among LVLMs to generate inconsistent descriptions with target images. Standard metrics like CIDEr may inadequately evaluate these outputs, highlighting a need for improved evaluation methodologies.
Embodied Intelligence: The evaluation covered embodied tasks requiring interactive environmental engagement. Models like LLaMA-Adapter V2 outperformed others due to comprehensive vision-language instruction.
Open-world Evaluation: The LVLM Arena component of LVLM-eHub enables human-feedback-driven evaluation, capturing LVLMs' performance in real-world scenarios. Models with extensive instruction-following data, such as mPLUG-Owl, ranked highly under this criterion.

Implications and Future Directions

The LVLM-eHub framework provides a foundational platform for comparing LVLMs, offering insights that guide their development. The findings emphasize the vital role of diverse data and refined instruction tuning to enhance LVLMs' adaptability and generalization. The paper challenges traditional evaluation metrics like CIDEr, advocating for the development of more nuanced evaluation strategies.

In terms of future advancements, the paper posits that innovations in multi-turn reasoning techniques and more sophisticated human-centered evaluations can further elucidate LVLMs’ capabilities, particularly in open-ended tasks. Furthermore, expanding the scope of LVLM-eHub with newer models and tasks will progressively improve our understanding and benchmarking of LVLM efficacy.

In conclusion, LVLM-eHub represents a significant step toward comprehensively evaluating the rapidly evolving LVLM landscape. By integrating robust metric-driven assessments with qualitative evaluations, it provides an invaluable resource for researchers aiming to enhance multimodal machine learning technologies.

PDF Markdown

GitHub

GitHub - OpenGVLab/Multi-Modality-Arena: Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more! (463 stars)

YouTube

Show All Videos