Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation (2407.00468v2)

Published 29 Jun 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, LLMs without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73\%$, compared to an average gap of $8.03\%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09\%$, whereas the gap for previous benchmarks is just $14.64\%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

Citations (2)

Summary

  • The paper introduces MMEvalPro, a novel benchmark that integrates perception and knowledge questions to reveal true multimodal capabilities.
  • It employs a Genuine Accuracy metric by testing models across three components, exposing significant performance differences between LMMs and LLMs.
  • The rigorous annotation pipeline, involving multiple experts, enhances data quality and reliability for advancing multimodal AI research.

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Introduction

The evaluation of Large Multimodal Models (LMMs) has been increasingly scrutinized due to their shortcomings in genuinely reflecting the abilities of these systems. While traditional benchmarks often employ multiple-choice questions (MCQs) with visual components, they do not adequately distinguish between models with and without visual understanding capabilities. The paper introduces MMEvalPro, a novel benchmark designed to reveal the true capabilities of LMMs by incorporating a more rigorous evaluation pipeline that includes perception and knowledge questions. Figure 1

Figure 1: LLMs and LMMs' performance comparison between original multimodal benchmarks and MMEvalPro. Performance gap between LLM and LMM is much clearer in MMEvalPro.

Probing the Credibility of Multimodal Benchmarks

Existing benchmarks are critiqued for allowing models lacking visual understanding to perform comparably to those that include it. The initial experiments reveal that the gap between LLMs and LMMs on standard benchmarks is not as significant as anticipated, indicating these evaluations do not fully capture the models' multimodal capabilities. Critical issues are data leakage and the ability for LLMs to guess answers without processing visual content.

The "Answer Consistency Test" illustrates that models often provide correct answers without true comprehension by failing on subsequent perception and knowledge checks that should conceptually precede or support the MCQ answer. The result is a prevalent Type-I error, where benchmarks inaccurately gauge genuine understanding. Figure 2

Figure 2: Examples from different splits in the MMEvalPro dataset.

MMEvalPro: A Comprehensive Benchmark

MMEvalPro addresses these deficiencies by implementing a trio of evaluations for each MCQ: an original question, a perception question about the visual data, and a knowledge anchor question requiring subject matter understanding. This trilogy aims to gauge the model's comprehensive capabilities. Genuine Accuracy (GA) is introduced as a primary metric, requiring correct answers across all triplet components.

A meticulously designed annotation pipeline ensures data quality, involving multiple annotators for redundancy and validation by domain experts. The pipeline ensures questions are created to necessitate both the visual and knowledge components critical for genuine comprehension. Figure 3

Figure 3: Annotation pipeline for MMEvalPro.

Experimental Evaluation

In experiments executed with MMEvalPro, a significant drop is observed in the performance of LLMs and some LMMs when compared to existing benchmarks. This drop highlights the challenge posed by the GA requirement and underscores the more rigorous evaluation MMEvalPro provides. Notably, the best LMMs still fall behind human baselines, demonstrating the benchmark's rigor. Figure 4

Figure 4: Heatmaps of conditional accuracy of MMEvalPro.

Fine-grained Analysis

Further analysis with metrics such as Consistency Gap (CG), Perception Accuracy (PA), and Knowledge Accuracy (KA) illustrates the underlying issues causing performance disparities. LMMs exhibit higher PA over KA, expected due to their visual processing capabilities, contrasting with LLMs' lack of such features.

Case studies further delineate these inconsistencies, with complex visual and conceptual reasoning tasks highlighting gaps in models' reasoning processes when images are misinterpreted or irrelevant. Figure 5

Figure 5: Case paper on the answer in-consistency problem of LMMs.

Conclusion

MMEvalPro offers a robust and nuanced framework for evaluating multimodal models, exposing the limitations of current systems while providing a path to improve their assessment. Its adoption could lead to more accurate benchmarks that reflect true model capabilities across modalities, fostering further advancements in multimodal AI research. As models evolve, benchmarks like MMEvalPro will continue to serve a critical role in assessing technological progress and guiding development priorities.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 4 likes.

Upgrade to Pro to view all of the tweets about this paper: