MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts (2310.02255v3)

Published 3 Oct 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: LLMs and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.

Citations (296)

View on Semantic Scholar

Summary

The paper presents MathVista, a novel benchmark for evaluating mathematical reasoning in visually rich contexts from 6,141 multimodal examples.
It examines seven reasoning types across five tasks, showing GPT-4V's performance at 49.9% accuracy—still 10.4% below human levels.
The analysis highlights GPT-4V's strong algebraic and geometric reasoning and its emergent self-verification capability, pointing to future research opportunities.

Evaluating the Mathematical Reasoning Abilities of Modern Models on MathVista

The paper "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts" presents a comprehensive benchmark, MathVista, to evaluate the mathematical reasoning capabilities of state-of-the-art models in visually rich environments. This endeavor aims to bridge the evident gap in existing evaluations that predominantly focus on textual mathematical reasoning, thereby overlooking the intrinsic visual nature of many mathematical problems.

MathVista is an extensive dataset composed of 6,141 examples sourced from 28 existing multimodal datasets accompanied by contributions from three newly formulated datasets: IQTest, FunctionQA, and PaperQA. These newly created datasets aim to fill the gaps left by existing resources by emphasizing logical reasoning on puzzles, algebraic reasoning on plots, and scientific reasoning with academic figures.

The evaluation benchmark covers five primary tasks: Figure Question Answering (FQA), Geometry Problem Solving (GPS), Math Word Problem (MWP), Textbook Question Answering (TQA), and Visual Question Answering (VQA). The paper focuses on seven core types of mathematical reasoning: algebraic, arithmetic, geometry, logical, numeric commonsense, scientific, and statistical reasoning.

A thorough analysis was conducted on 12 prominent models, including LLMs and LMMs. GPT-4V, the multimodal version of GPT-4, demonstrated superior performance by achieving an overall accuracy of 49.9\%, surpassing the Multimodal Bard, which stood at 34.8\%. Despite this significant advancement, GPT-4V still falls 10.4\% short of human performance, highlighting considerable scope for improvement.

Moreover, GPT-4V excelled particularly in algebraic and geometric reasoning, even surpassing human performance in some visual contexts like function plots and geometry diagrams. The analysis also revealed the emergent capability of GPT-4V to perform self-verification, which involves refining responses through internal consistency checks.

From a broader perspective, this paper illustrates an imperative need to continuously develop and refine general-purpose AI capable of effective mathematical reasoning within visual contexts. The limitations observed in current models, including challenges in logical reasoning or interpreting complex figures, suggest potential areas for future research and development.

In conclusion, MathVista stands as a pivotal contribution to the evaluation of AI models, offering a rigorous framework which underscores both the progress and challenges that lie ahead in mathematical reasoning within visual contexts. Achieving parity with human reasoning abilities across diverse tasks and contexts remains an ambitious yet critical goal for the AI research community.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/Montreal_AI/status/1747410974070477003

Reddit

The race is truly on. Google / Gemini starting to showing up as #1 in some leaderboards (73 points, 27 comments)
"MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts", Lu et al., 2023 (9 points, 2 comments)