Emergent Mind

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

(2406.14515)
Published Jun 20, 2024 in cs.CV and cs.MM

Abstract

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

Comparison of question type distribution across MSVD, MSRVTT, and MMBench-Video, highlighting balanced diversity in MMBench-Video.

Overview

  • The paper introduces MMBench-Video, a new benchmark designed to evaluate the video understanding capabilities of Large Vision-Language Models (LVLMs) by addressing limitations in existing Video Question Answering (VideoQA) benchmarks.

  • Key innovations of MMBench-Video include a comprehensive video dataset sourced from YouTube, enhanced QA pairs with a fine-grained capability taxonomy, and a robust evaluation framework using GPT-4 for scoring.

  • Evaluation results reveal performance disparities between proprietary and open-source LVLMs, highlight the importance of temporal and spatial understanding, and discuss the role of auxiliary data such as subtitles in improving model performance.

Analyzing MMBench-Video: A Benchmark for Comprehensive Video Understanding

The paper "MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding" presents a new benchmark explicitly designed to evaluate the video understanding capabilities of Large Vision-Language Models (LVLMs). The authors address significant limitations in existing Video Question Answering (VideoQA) benchmarks and propose MMBench-Video as a more rigorous and holistic benchmark. This essay will delve into the methodological innovations, evaluation results, and implications of MMBench-Video for future research in video understanding.

Key Innovations in MMBench-Video

MMBench-Video is marked by several crucial innovations that differentiate it from previous benchmarks:

  1. Comprehensive Video Dataset:

    • Source and Diversity: The benchmark is built from lengthy videos sourced from YouTube, covering 16 different categories such as News, Sports, and Knowledge, thereby mirroring real-world video consumption patterns.
    • Temporal Coverage: Videos included in MMBench-Video range from 30 seconds to 6 minutes, significantly longer than those in most existing benchmarks. This inclusion of long-form content is vital for assessing temporal reasoning capabilities.
  2. Enhanced Question-Answer (QA) Pairs:

    • Fine-grained Taxonomy: The benchmark employs a hierarchical capability taxonomy with 26 fine-grained abilities, spanning both perception and reasoning domains. This offers a nuanced evaluation of LVLMs.
    • Temporal Indispensability: Special emphasis is placed on formulating temporally indispensable questions that cannot be answered from a single frame, thus rigorously testing the models' temporal comprehension.
  3. Robust Evaluation Framework:

    • Automated Evaluation with GPT-4: The authors utilize GPT-4 for scoring model responses, with a 3-grade marking scheme designed to prioritize semantic similarity and align with human judgments, addressing shortcomings in prior evaluation methods that utilized GPT-3.5.

Evaluation Insights

The paper presents exhaustive evaluations of both proprietary and open-source LVLMs using MMBench-Video, shedding light on the following insights:

  1. Performance Disparities:

    • Video-LLMs vs. Image LVLMs: Surprisingly, existing open-source Video-LLMs lag behind image-based LVLMs such as Idefics2-8B and InternVL-Chat-v1.5 in temporal reasoning and overall video understanding, revealing a significant performance gap.
    • Proprietary LVLMs: Models like GPT-4o and Gemini-Pro demonstrate superior performance, notably surpassing open-source counterparts. For instance, GPT-4o achieves an overall score significantly higher than the best open-source Video-LLM.
  2. Temporal and Spatial Understanding:

    • Frame Input Influence: The number of input frames significantly impacts the performance of LVLMs. Proprietary models, when processing multiple frames, show marked improvement in both perception and reasoning tasks.
    • Hallucination Reduction: Hallucination remains a significant challenge for many models, indicating the need for improved training mechanisms to address misinformation generation.
  3. Role of Auxiliary Data:

    • Incorporating Subtitles: The integration of YouTube-generated subtitles notably enhances model performance, particularly in reasoning tasks, by leveraging rich contextual information from speech.

Implications and Future Directions

The introduction of MMBench-Video profoundly impacts both practical and theoretical aspects of video understanding research. Practically, this benchmark provides a valuable resource for the comprehensive evaluation of LVLMs, guiding the development of more capable and robust models. Theoretically, the detailed insights into the fine-grained capabilities and limitations of existing models offer a foundation for future advancements.

Future Developments:

  • Enhanced Temporal Model Architectures: There is a clear need for developing models that can better integrate temporal information, perhaps through more sophisticated temporal fusion techniques or memory-augmented architectures.
  • Broader Dataset Inclusion: Expanding MMBench-Video to include even longer videos or more varied content types (e.g., documentaries, tutorials) could further its comprehensiveness.
  • Fine-tuning with Rich Contexts: Incorporating more contextual data, such as surrounding text or audio, could enhance the models' understanding of nuanced video content.

In conclusion, MMBench-Video represents a significant advancement in the benchmarking of video understanding capabilities in LVLMs. By addressing existing limitations and setting new standards for evaluation, it paves the way for the next generation of video comprehension models. Future research should build on the insights gleaned from MMBench-Video, focusing on creating more temporally aware and contextually enriched vision-language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube