MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models (2306.13394v4)

Published 23 Jun 2023 in cs.CV

Abstract: Multimodal LLM (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data application manner and online leaderboards are released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

References (59)

Citations (544)

View on Semantic Scholar

Summary

The paper introduces the MME benchmark, a novel framework evaluating multimodal LLMs across 14 distinct perceptual and cognitive subtasks.
The paper employs manually crafted test instructions to ensure unbiased performance evaluation and minimize prompt engineering influence.
The paper identifies challenges such as object hallucination and reasoning coherence, offering actionable insights for future multimodal model improvements.

MME: A Comprehensive Evaluation Benchmark for Multimodal LLMs

The introduction of the MME benchmark represents a significant advancement in the quantitative evaluation of Multimodal LLMs (MLLMs). MLLMs leverage the capabilities of LLMs to process multimodal tasks, integrating inputs across different modes such as text and images to perform complex reasoning and perceptive tasks. The MME benchmark is designed to provide a structured and unbiased assessment of these capabilities by evaluating both perception and cognition across 14 distinct subtasks.

Evaluation Framework and Instruction Design

MME measures the performance of MLLMs across a suite of tasks that include both low-level perceptual abilities (such as object recognition, counting, and color identification) and higher-level cognitive abilities (such as commonsense reasoning, numerical calculation, and code comprehension). To ensure fair evaluation and mitigate issues such as data leakage, all test instructions and answer pairs are manually crafted rather than sourced directly from existing datasets.

Figure 1: Diagram of our MME benchmark. It evaluates MLLMs from both perception and cognition, including a total of 14 subtasks.

The instructions are carefully designed to minimize the influence of prompt engineering, consisting of straightforward queries followed by a directive to respond with either "yes" or "no". This standardized format allows for straightforward scoring, focusing on accuracy and providing an additional stricter metric termed accuracy+ that requires correct answers to two related questions per image.

Perception Subtasks

In the field of perception, MME evaluates capabilities that include object existence, count, position, and color recognition, as well as more nuanced tasks such as celebrity, scene, and landmark recognition. These tasks test the models' capacity to identify and understand visual elements and their contextual significance.

The leaderboard (Figure 2) illustrates the relative performance of various MLLMs on these subtasks, highlighting strengths such as superior object existence recognition exhibited by models like Otter and Lynx. However, challenges remain, particularly in object positioning and fine-grained recognitions such as scene and landmark identification.

Figure 2: Leaderboards on our MME benchmark. Overall leaderboards for perception and cognition, and subtasks detailing precision and inconsistencies among models.

Cognition Subtasks

Cognitive subtasks in MME encompass more abstract reasoning abilities, where MLLMs translate visual understanding into logical conclusions. These include commonsense reasoning, numerical calculations, text translation from images, and code reasoning tasks that combine visual processing with retained knowledge.

GPT-4V demonstrates robust cognitive capabilities, consistently topping the charts in tasks requiring high-level reasoning, although with room for improvement in broader generalization beyond predefined tasks. The challenges in cognition are accentuated by the complexity and varied nature of human reasoning, as well as the ability to suppress learned hallucinations inherent in model training.

Common Challenges and Observations

The benchmark reveals several systemic issues across evaluated MLLMs. These include the inability of some models to adhere to instructions, thus reflecting a lack of alignment between instruction-following and output generation (Figure 3). Another concern is inadequate perception, where the model's basic visual recognition failures lead to incorrect higher-layer reasoning.

Figure 3: Common problems revealed in experiments. [Y]/[N] means the ground truth answer is yes/no. [R] is the generated answer.

Complex reasoning tasks further expose a deficiency in maintaining coherent logic chains, where models might identify facts correctly but fail to synthesize these into the appropriate inferential structure. Additionally, object hallucination—a state where models infer the presence of objects that are not visually observed—continues to highlight the challenge of grounding multimodal outputs in existent contextually provided data.

Conclusion

The MME benchmark stands as a critical tool for systematizing the evaluation of MLLMs, underscoring existing deficiencies while guiding future advancements. Its multifaceted evaluation process underscores significant gaps in current technology, particularly in reasoning, alignment, and data grounding. By addressing these challenges and extending the benchmark's capabilities, subsequent versions of MLLMs can be better trained and evaluated, enhancing both practical applications and theoretical understanding of multimodal AI systems. Future research can leverage these insights to create more robust models capable of handling a diverse array of real-world multimodal tasks with greater precision and reliability.