Emergent Mind

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

(2401.11944)
Published Jan 22, 2024 in cs.CL , cs.AI , and cs.CV

Abstract

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

Examples of complex media (music, chemistry, circuits, etc.) requiring expert-level understanding depicted.

Overview

  • The paper evaluates the GPT-4V model's ability to understand and process multimodal data across various disciplines.

  • It critically analyses the model's accuracy, identifying perceptual, reasoning, and extraction errors, as well as knowledge limits and answer refusal instances.

  • Examples are provided where the GPT-4V model fails at image interpretation, data extrapolation, selecting correct answers, and domain-specific queries.

  • Enhancements are recommended for the model in areas such as image parsing, complex data comprehension, and domain knowledge.

  • The paper concludes that despite GPT-4V's impressive capabilities, there are significant opportunities for improvement and development.

Understanding the GPT-4V Model's Performance on Diverse Tasks

Introduction to GPT-4V Model Evaluation

The evaluation of the GPT-4V model's capabilities across disciplines highlights the complexities of contextual understanding, particularly in nuanced tasks combining graphical information with textual metadata. The efficiency of the model's outputs was rigorously tested, not just through binary tasks but also through its aptitude for intricate details within imagery and comprehensive data tables.

Analysis of Model Accuracy and Limitations

Detailed examinations of the model's responses were carried out. It's observable that GPT-4V manifests high accuracy in instances where context clues within the text and imagery were overt and logically coherent. However, several cases spotlight noteworthy errors which can be categorized into perceptual errors, reasoning errors, answer extraction errors, lack of knowledge, and instances of complete rejection to answer. Each demarcated category sheds light on specific improvements that can potentially enhance the model's outputs.

Examples of Model Performance

  • Instances of Perceptual Errors were evidenced when the model struggled with image interpretations that required granular scrutiny or when explicit numerical details were essential for a precise response.
  • Reasoning Errors were most prevalent in tasks requiring extrapolation from provided data or where multiple steps of logic were essential to reach a valid conclusion.
  • When it came to Answer Extraction Errors, it was found that even when the model correctly understood a question, it occasionally selected incorrect answers due to extractive limitations of its current programming.
  • The Lack of Knowledge surfaced when the model faced questions demanding expertise outside of its trained scope or when nuanced domain knowledge was required for accurate answers.
  • The model at times opted for a complete Rejection to Answer, predominantly in situations where ethical considerations or a lack of the domain-specific understanding were present.

Recommendations for Model Improvement

Based on the evaluation, it's evident that enhancements in the model's ability to parse images, improved understanding of complex data interaction, and a boost in the domain-specific knowledge bank could propel the GPT-4V towards delivering more refined results. Additionally, fine-tuning its answer extraction processes and inferring capabilities could further mitigate erroneous outputs, while implementing context-sensitive refusal-to-answer protocols will present a more user-friendly response system.

Conclusion

In summary, while the GPT-4V model exhibits remarkable abilities in a diverse array of tasks, it's clear that there's an avenue for continuous learning and development. Advances in AI modeling will make way for heightened precision in task execution, more robust knowledge application, and a sophisticated approach to complex problem-solving, enriching the journey towards AI that closely simulates human-level understanding and reasoning.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.