Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2405.21075v2)

Published 31 May 2024 in cs.CV and cs.CL

Abstract: In the quest for artificial general intelligence, Multi-modal LLMs (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io

Citations (99)

View on Semantic Scholar

Summary

The paper introduces Video-MME, a novel benchmark assessing multi-modal LLMs' ability to analyze diverse video types and dynamic temporal contexts.
The paper demonstrates that integrating subtitles and audio boosts model accuracy by up to 13.3%, highlighting the value of multi-modal data integration.
The paper reveals challenges in long-term video analysis, underscoring the need for temporal modeling innovations in both commercial and open-source models.

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

The presented paper introduces Video-MME, a novel evaluation benchmark aimed at comprehensively assessing the performance of Multi-Modal LLMs (MLLMs) in video analysis. The need for this benchmark arises from the insufficient exploration and assessment of MLLM capabilities in processing sequential visual data, a gap that has restricted the understanding of these models' true potential across dynamic, real-world scenarios.

Key Features of Video-MME

Video-MME is distinguished from existing benchmarks through several critical features:

Diversity in Video Types: The benchmark encompasses six primary visual domains and 30 subfields, covering areas such as Knowledge, Film & Television, Sports Competitions, Artistic Performances, Life Recordings, and Multilingual videos. This ensures broad scenario generalizability.
Duration in Temporal Dimension: Videos range from 11 seconds to 1 hour, capturing short-, medium-, and long-term dynamics. This robust temporal diversity facilitates the evaluation of MLLMs' ability to understand varying contextual dynamics.
Breadth in Data Modalities: Video-MME integrates multiple data modalities beyond video frames, including subtitles and audios, which enhances the evaluation's coverage of MLLMs' all-round capabilities.
Quality in Annotations: The dataset includes 900 manually selected videos with 2,700 question-answer (QA) pairs annotated by expert annotators. This rigorous manual labeling facilitates precise and reliable model assessments.

Experimental Results

The experiments conducted with Video-MME provide a comprehensive evaluation of various state-of-the-art MLLMs, including both commercial (e.g., GPT-4 series, Gemini 1.5 Pro) and open-source models (e.g., InternVL-Chat-V1.5, LLaVA-NeXT-Video).

Performance of Commercial Models

Gemini 1.5 Pro demonstrated superior performance, achieving an average accuracy of 75.7%, significantly outperforming the best open-source model (LLaVA-NeXT-Video) which achieved 52.5%.
The addition of subtitles and audio, as evaluated with Gemini 1.5 Pro, showed substantial improvements in model accuracy (up to +13.3% in some subcategories), particularly in processing longer videos and tasks requiring substantial domain knowledge.

Performance of Open-Source Models

Among the open-source models, LLaVA-NeXT-Video showed the best performance with an overall accuracy of 52.5%, indicating a considerable gap between commercial and open-source models.
For image-based models like Qwen-VL-Max and InternVL-Chat-V1.5, accuracies were comparable to video-specific models, highlighting the importance of robust image understanding as a foundation for video analysis.

Implications and Future Directions

The results using Video-MME reveal several critical insights into the current state of MLLMs and their future development:

Temporal Dynamics and Long Context Modeling: Both commercial and open-source models show a decline in performance as video length increases, indicating challenges in long context understanding. Future research should focus on architectural innovations, such as temporal Q-Formers and context extension techniques to better handle long-range dependencies in video data.
Subtitles and Auditory Information: The incorporation of subtitles and audio tracks significantly enhances video understanding, underscoring the importance of multi-modal data. Developing models that can seamlessly integrate these additional modalities will be crucial for improving comprehension in complex, real-world scenarios.
Diverse and High-Quality Datasets: Building high-quality, diverse datasets with complex temporal reasoning tasks is essential. This will require novel approaches to data collection and annotation, potentially including human-in-the-loop frameworks and automatic data synthesis methods to address the long-tailed nature of multi-modal video data.

Conclusion

Video-MME represents a significant step forward in the evaluation of MLLMs for video analysis, providing a robust benchmark that addresses the limitations of existing benchmarks through its comprehensive scope. By revealing critical areas for improvement and highlighting the importance of multi-modal data integration, Video-MME sets the stage for future advancements in the development and evaluation of MLLMs. This benchmark is expected to inspire future research aimed at achieving more capable and robust multi-modal models, furthering the progress towards more sophisticated and nuanced video understanding capabilities.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1797444124351533179

https://twitter.com/_akhaliq/status/1797474099096150249

https://twitter.com/fly51fly/status/1797748033511412078

https://twitter.com/taziku_co/status/1797582828299006243

https://twitter.com/knishimae0531/status/1797773798269452687

https://twitter.com/gm8xx8/status/1797736063991194021