Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models (2306.13394v4)

Published 23 Jun 2023 in cs.CV

Abstract: Multimodal LLM (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data application manner and online leaderboards are released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Infmllm. https://github.com/mightyzau/InfMLLM, 2023a.
  2. Lion. https://github.com/mynameischaos/Lion, 2023b.
  3. Octopus. https://github.com/gray311/UnifiedMultimodalInstructionTuning, 2023c.
  4. Skywork-mm. https://github.com/will-singularity/Skywork-MM, 2023d.
  5. Visualglm-6b. https://github.com/THUDM/VisualGLM-6B, 2023e.
  6. Wemm. https://github.com/scenarios/WeMM, 2023f.
  7. Xcomposer-vl. https://github.com/InternLM/InternLM-XComposer, 2023g.
  8. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  9. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint:2308.12966, 2023.
  10. Language models are few-shot learners. NeurIPS, 2020.
  11. Microsoft coco captions: Data collection and evaluation server. arXiv preprint:1504.00325, 2015.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint:2305.06500, 2023.
  13. Palm-e: An embodied multimodal language model. arXiv preprint:2303.03378, 2023.
  14. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint:2304.15010, 2023.
  15. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint:2305.04790, 2023.
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  17. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint:2309.03905, 2023.
  18. Bliva: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint:2308.09936, 2023.
  19. Movienet: A holistic dataset for movie understanding. In ECCV, 2020.
  20. Language is not all you need: Aligning perception with language models. arXiv preprint:2302.14045, 2023.
  21. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint:2306.05425, 2023a.
  22. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint:2305.03726, 2023b.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint:2301.12597, 2023c.
  24. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. arXiv preprint:2308.04152, 2023d.
  25. Evaluating object hallucination in large vision-language models. arXiv preprint:2305.10355, 2023e.
  26. Microsoft coco: Common objects in context. In ECCV, 2014.
  27. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint:2311.07575, 2023.
  28. Aligning large multi-modal model with robust instruction tuning. arXiv preprint:2306.14565, 2023a.
  29. Visual instruction tuning. arXiv preprint:2304.08485, 2023b.
  30. Curved scene text detection via transverse and longitudinal sequence connection. PR, 2019.
  31. Mmbench: Is your multi-modal model an all-around player? arXiv preprint:2307.06281, 2023c.
  32. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
  33. Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint:2305.15023, 2023.
  34. Deepart: Learning joint representations of visual arts. In ICM, 2017.
  35. Visual arts search on mobile devices. TOMM, 2019.
  36. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  37. OpenAI. Gpt-4 technical report. arXiv preprint:2303.08774, 2023.
  38. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint:2303.17580, 2023.
  39. Pandagpt: One model to instruction-follow them all. arXiv preprint:2305.16355, 2023.
  40. Llama: Open and efficient foundation language models. arXiv preprint:2302.13971, 2023.
  41. Git: A generative image-to-text transformer for vision and language. arXiv preprint:2205.14100, 2022.
  42. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint:2305.11175, 2023.
  43. Chain of thought prompting elicits reasoning in large language models. arXiv preprint:2201.11903, 2022.
  44. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In CVPR, 2020.
  45. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint:2303.04671, 2023a.
  46. An early evaluation of gpt-4v (ision). arXiv preprint:2310.16534, 2023b.
  47. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint:2212.10773, 2022.
  48. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint:2311.04257, 2023.
  49. A survey on multimodal large language models. arXiv preprint:2306.13549, 2023a.
  50. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint:2310.16045, 2023b.
  51. Reformulating vision-language foundation models and datasets towards universal multimodal assistants. arXiv preprint:2310.00653, 2023.
  52. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint:2307.02469, 2023.
  53. Transfer visual prompt generator across llms. arXiv preprint:2305.01278, 2023.
  54. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint:2309.07915, 2023a.
  55. A survey of large language models. arXiv preprint:2303.18223, 2023b.
  56. On evaluating adversarial robustness of large vision-language models. arXiv preprint:2305.16934, 2023c.
  57. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint:2305.16103, 2023d.
  58. Learning deep features for scene recognition using places database. NeurIPS, 2014.
  59. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint:2304.10592, 2023.
Citations (544)

Summary

  • The paper introduces the MME benchmark, a novel framework evaluating multimodal LLMs across 14 distinct perceptual and cognitive subtasks.
  • The paper employs manually crafted test instructions to ensure unbiased performance evaluation and minimize prompt engineering influence.
  • The paper identifies challenges such as object hallucination and reasoning coherence, offering actionable insights for future multimodal model improvements.

MME: A Comprehensive Evaluation Benchmark for Multimodal LLMs

The introduction of the MME benchmark represents a significant advancement in the quantitative evaluation of Multimodal LLMs (MLLMs). MLLMs leverage the capabilities of LLMs to process multimodal tasks, integrating inputs across different modes such as text and images to perform complex reasoning and perceptive tasks. The MME benchmark is designed to provide a structured and unbiased assessment of these capabilities by evaluating both perception and cognition across 14 distinct subtasks.

Evaluation Framework and Instruction Design

MME measures the performance of MLLMs across a suite of tasks that include both low-level perceptual abilities (such as object recognition, counting, and color identification) and higher-level cognitive abilities (such as commonsense reasoning, numerical calculation, and code comprehension). To ensure fair evaluation and mitigate issues such as data leakage, all test instructions and answer pairs are manually crafted rather than sourced directly from existing datasets. Figure 1

Figure 1: Diagram of our MME benchmark. It evaluates MLLMs from both perception and cognition, including a total of 14 subtasks.

The instructions are carefully designed to minimize the influence of prompt engineering, consisting of straightforward queries followed by a directive to respond with either "yes" or "no". This standardized format allows for straightforward scoring, focusing on accuracy and providing an additional stricter metric termed accuracy+ that requires correct answers to two related questions per image.

Perception Subtasks

In the field of perception, MME evaluates capabilities that include object existence, count, position, and color recognition, as well as more nuanced tasks such as celebrity, scene, and landmark recognition. These tasks test the models' capacity to identify and understand visual elements and their contextual significance.

The leaderboard (Figure 2) illustrates the relative performance of various MLLMs on these subtasks, highlighting strengths such as superior object existence recognition exhibited by models like Otter and Lynx. However, challenges remain, particularly in object positioning and fine-grained recognitions such as scene and landmark identification. Figure 2

Figure 2: Leaderboards on our MME benchmark. Overall leaderboards for perception and cognition, and subtasks detailing precision and inconsistencies among models.

Cognition Subtasks

Cognitive subtasks in MME encompass more abstract reasoning abilities, where MLLMs translate visual understanding into logical conclusions. These include commonsense reasoning, numerical calculations, text translation from images, and code reasoning tasks that combine visual processing with retained knowledge.

GPT-4V demonstrates robust cognitive capabilities, consistently topping the charts in tasks requiring high-level reasoning, although with room for improvement in broader generalization beyond predefined tasks. The challenges in cognition are accentuated by the complexity and varied nature of human reasoning, as well as the ability to suppress learned hallucinations inherent in model training.

Common Challenges and Observations

The benchmark reveals several systemic issues across evaluated MLLMs. These include the inability of some models to adhere to instructions, thus reflecting a lack of alignment between instruction-following and output generation (Figure 3). Another concern is inadequate perception, where the model's basic visual recognition failures lead to incorrect higher-layer reasoning. Figure 3

Figure 3: Common problems revealed in experiments. [Y]/[N] means the ground truth answer is yes/no. [R] is the generated answer.

Complex reasoning tasks further expose a deficiency in maintaining coherent logic chains, where models might identify facts correctly but fail to synthesize these into the appropriate inferential structure. Additionally, object hallucination—a state where models infer the presence of objects that are not visually observed—continues to highlight the challenge of grounding multimodal outputs in existent contextually provided data.

Conclusion

The MME benchmark stands as a critical tool for systematizing the evaluation of MLLMs, underscoring existing deficiencies while guiding future advancements. Its multifaceted evaluation process underscores significant gaps in current technology, particularly in reasoning, alignment, and data grounding. By addressing these challenges and extending the benchmark's capabilities, subsequent versions of MLLMs can be better trained and evaluated, enhancing both practical applications and theoretical understanding of multimodal AI systems. Future research can leverage these insights to create more robust models capable of handling a diverse array of real-world multimodal tasks with greater precision and reliability.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com