MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Published 4 Sep 2024 in cs.CL and cs.CV | (2409.02813v3)

Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.

Abstract PDF Upgrade to Chat

Authors (13)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces methodical refinements, including question filtering and option augmentation, to create a truly challenging multimodal evaluation.
The paper finds that models show a 16.8% to 26.9% accuracy drop when processing integrated visual and text inputs, highlighting current limitations.
The paper suggests future research to improve multimodal reasoning by developing advanced visual-text integration and refined evaluation strategies.

Comprehensive Evaluation of Multimodal Understanding: The Introduction of MMMU-Pro Benchmark

The research paper at hand presents MMMU-Pro, an evolved and rigorous benchmark designed to evaluate the true capabilities of multimodal LLMs (MLLMs) in understanding and reasoning across multiple disciplines. MMMU-Pro is a continuation and improvement upon the earlier Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark, which had limitations that are addressed in this paper.

Core Enhancements in MMMU-Pro

The authors introduce several methodical refinements to achieve a more challenging evaluation tool:

Question Filtering: MMMU-Pro begins by filtering out questions that can be correctly answered by text-only models. This rigorous selection ensures that the remaining set demands a true multimodal understanding, eliminating dependency on textual cues alone. Four state-of-the-art open-source LLMs were utilized as a filter to ensure this robustness.
Option Augmentation: The benchmark increases the complexity of answering questions by expanding the multiple-choice options from four to ten. This augmentation is a strategic step to curtail models from exploiting option-based shortcuts, forcing a deeper engagement with the multimodal inputs.
Vision-only Input Setting: MMMU-Pro innovates by embedding questions within images in a vision-only setting. This simulates real-world scenarios where textual and visual information is integrated, challenging models to synthesize information in a manner akin to human cognitive processing.

Findings and Performance Analysis

The experimental results reveal that models, when faced with MMMU-Pro, show a substantial drop in performance compared to the original MMMU benchmark, with accuracy reductions ranging from 16.8% to 26.9%. This decline underscores the efficacy of MMMU-Pro in testing the boundaries of model understanding. Notably, even advanced optical character recognition (OCR) capabilities only marginally impact these outcomes, suggesting that the benchmark’s complexity lies beyond mere text extraction.

Moreover, the application of Chain of Thought (CoT) reasoning strategies generally elevates performance but also showcases variability across different models, affirming that while CoT aids reasoning, its effectiveness is model-dependent.

Implications and Future Research Directions

The findings from MMMU-Pro provide an important framework for evaluating and advancing AI systems’ multimodal understanding. Key implications include:

Model Development: Efforts should be directed towards enhancing the integration of visual-textual data within models to address the integrated input challenge presented by MMMU-Pro. This involves refining how models perceive and reason about complex scenes where text and images coalesce.
Evaluation Strategies: MMMU-Pro sheds light on the need for benchmarks that accurately reflect real-world scenarios that users encounter, fostering the creation of models equipped to handle diverse, integrated inputs seamlessly.
Sophisticated Reasoning Capabilities: Given the identified performance drops, future research should explore augmenting multimodal reasoning frameworks in AI, pushing beyond current capabilities to deal with intricate, nuanced, and contextually-rich inputs.

Conclusion

MMMU-Pro succeeds in its mission to elevate the evaluation of multimodal AI systems by introducing a challenging benchmark that stands as a better proxy for real-world application demands. Its design considerations offer insights that could steer future multimodal model advancements. Additionally, MMMU-Pro sets a precedent for constructing thoroughly challenging evaluation mechanisms, promoting a deeper understanding of how multimodal models can be aligned with human-like cognitive processing capabilities.