Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
60 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (2311.16502v4)

Published 27 Nov 2023 in cs.CL, cs.AI, and cs.CV

Abstract: We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. The evaluation of 14 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

Citations (466)

Summary

  • The paper introduces the MMMU benchmark, a comprehensive test with 11.5K multimodal questions spanning 183 subfields to evaluate expert-level AGI reasoning.
  • It employs diverse image types integrated with textual queries to challenge models, revealing distinct strengths and deficiencies across six academic disciplines.
  • Findings highlight that even advanced models like GPT-4Vision score only 56% accuracy, emphasizing areas for improvement in perception, domain knowledge, and reasoning.

The Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark represents a significant advancement in evaluating the current capabilities of AGI, specifically in multimodal understanding and reasoning. MMMU challenges models with a series of comprehensive and difficult tasks reminiscent of college-level examinations encompassing six broad disciplines and thirty college subjects, categorically spanning 183 subfields. These disciplines include Arts & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.

What sets MMMU apart is its extensive library of 11.5K questions meticulously collected from various academic materials. The inclusion of 30 highly diverse types of images—ranging from engineering blueprints to histological slides—reflects the benchmark's emphasis on deep multimodal engagement. The images are interwoven with text, requiring models to analyze and reason through a combination of visual and textual cues bound by domain expertise.

MMMU's goal is to test models' ability to achieve expert-level perception and advanced reasoning skills. This is observed in its evaluation of state-of-the-art LLMs and Large Multimodal Models (LMMs), such as OpenFlamingo and GPT-4, among others. The results from MMMU reveal significant challenges posed to even the most advanced models, like GPT-4Vision, which only achieved a 56% accuracy score. However, a closer examination of 150 mispredicted cases by GPT-4Vision shows that 35% of errors can be attributed to perceptual issues, 29% to a lack of domain knowledge, and 26% to reasoning process flaws, signaling specific areas for further research and model enhancement.

Furthermore, the benchmark demonstrates different performance across various disciplines. For example, in fields like Art & Design and Humanities & Social Science, where visual complexity is relatively manageable, models show higher performance compared to more complex domains like Business, Science, Health & Medicine, and Tech & Engineering. Unsurprisingly, open-source models lag behind proprietary iterations like GPT-4Vision, highlighting a disparity in the capabilities of multimodal understanding.

MMMU is not a definitive test for Expert AGI, as it does not yet encompass the range of tasks an AGI should handle, and it does not directly measure performance against the 90th percentile of skilled adults. Nevertheless, it serves as a crucial component in evaluating a model’s proficiency in domain-specific knowledge and expert-level reasoning and understanding.

In conclusion, MMMU represents a robust and challenging multimodal benchmark that pushes the boundaries of multimodal foundation models. Its unique approach in integrating diverse image types with text rooted in domain-specific questions sets a new precedent for AGI evaluation and will undoubtedly propel the AI community towards more profound advancements.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com