CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark (2401.11944v4)

Published 22 Jan 2024 in cs.CL, cs.AI, and cs.CV

Abstract: As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

Citations (23)

View on Semantic Scholar

Summary

The paper presents CMMMU, a comprehensive benchmark that assesses multimodal AI performance across textual, visual, and structured data tasks.
The analysis reveals that while models perform accurately on clear, coherent inputs, they exhibit perceptual, reasoning, and extraction errors on nuanced tasks.
The study recommends enhancements in image parsing, logical inference, and domain-specific knowledge to refine AI multimodal understanding.

Understanding the GPT-4V Model's Performance on Diverse Tasks

Introduction to GPT-4V Model Evaluation

The evaluation of the GPT-4V model's capabilities across disciplines highlights the complexities of contextual understanding, particularly in nuanced tasks combining graphical information with textual metadata. The efficiency of the model's outputs was rigorously tested, not just through binary tasks but also through its aptitude for intricate details within imagery and comprehensive data tables.

Analysis of Model Accuracy and Limitations

Detailed examinations of the model's responses were carried out. It's observable that GPT-4V manifests high accuracy in instances where context clues within the text and imagery were overt and logically coherent. However, several cases spotlight noteworthy errors which can be categorized into perceptual errors, reasoning errors, answer extraction errors, lack of knowledge, and instances of complete rejection to answer. Each demarcated category sheds light on specific improvements that can potentially enhance the model's outputs.

Examples of Model Performance

Instances of Perceptual Errors were evidenced when the model struggled with image interpretations that required granular scrutiny or when explicit numerical details were essential for a precise response.
Reasoning Errors were most prevalent in tasks requiring extrapolation from provided data or where multiple steps of logic were essential to reach a valid conclusion.
When it came to Answer Extraction Errors, it was found that even when the model correctly understood a question, it occasionally selected incorrect answers due to extractive limitations of its current programming.
The Lack of Knowledge surfaced when the model faced questions demanding expertise outside of its trained scope or when nuanced domain knowledge was required for accurate answers.
The model at times opted for a complete Rejection to Answer, predominantly in situations where ethical considerations or a lack of the domain-specific understanding were present.

Recommendations for Model Improvement

Based on the evaluation, it's evident that enhancements in the model's ability to parse images, improved understanding of complex data interaction, and a boost in the domain-specific knowledge bank could propel the GPT-4V towards delivering more refined results. Additionally, fine-tuning its answer extraction processes and inferring capabilities could further mitigate erroneous outputs, while implementing context-sensitive refusal-to-answer protocols will present a more user-friendly response system.

Conclusion

In summary, while the GPT-4V model exhibits remarkable abilities in a diverse array of tasks, it's clear that there's an avenue for continuous learning and development. Advances in AI modeling will make way for heightened precision in task execution, more robust knowledge application, and a sophisticated approach to complex problem-solving, enriching the journey towards AI that closely simulates human-level understanding and reasoning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1749635037945823723

https://twitter.com/AdeenaY8/status/1749764675045409059

https://twitter.com/BrianRoemmele/status/1749819176913662154

https://twitter.com/GeZhang86038849/status/1749676937931596238

https://twitter.com/knishimae0531/status/1749951956117016888

https://twitter.com/GeZhang86038849/status/1749664656648741102