Mathematical Capabilities of ChatGPT (2301.13867v2)

Published 31 Jan 2023 in cs.LG, cs.AI, and cs.CL

Abstract: We investigate the mathematical capabilities of two iterations of ChatGPT (released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on publicly available datasets, as well as hand-crafted ones, using a novel methodology. In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark LLMs, either cover only elementary mathematics or are very small. We address this by publicly releasing two new datasets: GHOSTS and miniGHOSTS. These are the first natural-language datasets curated by working researchers in mathematics that (1) aim to cover graduate-level mathematics, (2) provide a holistic overview of the mathematical capabilities of LLMs, and (3) distinguish multiple dimensions of mathematical reasoning. These datasets also test whether ChatGPT and GPT-4 can be helpful assistants to professional mathematicians by emulating use cases that arise in the daily professional activities of mathematicians. We benchmark the models on a range of fine-grained performance metrics. For advanced mathematics, this is the most detailed evaluation effort to date. We find that ChatGPT can be used most successfully as a mathematical assistant for querying facts, acting as a mathematical search engine and knowledge base interface. GPT-4 can additionally be used for undergraduate-level mathematics but fails on graduate-level difficulty. Contrary to many positive reports in the media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of selection bias), their overall mathematical performance is well below the level of a graduate student. Hence, if your goal is to use ChatGPT to pass a graduate-level math exam, you would be better off copying from your average peer!

Citations (345)

View on Semantic Scholar

Summary

The paper introduces the GHOSTS and miniGHOSTS datasets to systematically assess ChatGPT’s mathematical reasoning and track performance improvements.
It employs manual inspection and multi-axis error categorization across 709 and 170 prompts to evaluate model competencies and pinpoint weaknesses.
The findings reveal that while ChatGPT excels in basic fact retrieval, it struggles with advanced proof-based tasks, with GPT-4 showing significant gains.

Mathematical Capabilities of ChatGPT: An Academic Overview

The paper under discussion provides a rigorous evaluation of the mathematical capabilities of OpenAI's ChatGPT and its successor, GPT-4, with a focus on assessing their utility as mathematical assistants. The researchers, led by Simon Frieder and colleagues from institutions like the University of Oxford and Vienna University of Technology, presented a novel approach involving two newly curated datasets, GHOSTS and miniGHOSTS. These datasets aim to test mathematical reasoning and cover advanced mathematics comprehensively. The datasets encompass graduate-level problems and categorize the mathematical tasks along various dimensions, such as difficulty level and question type.

Main Contributions

The authors articulate three key contributions. First, they introduce the GHOSTS dataset, designed to evaluate the mathematical capabilities of LLMs across a spectrum of mathematical reasoning tasks. Second, they provide insights on practical use cases for ChatGPT in mathematical contexts, highlighting areas where it is competent, such as querying factual mathematical knowledge. Finally, they track the evolution of mathematical performance across different versions of ChatGPT, noting the improvements brought by GPT-4.

Evaluation and Methodology

The evaluation method entails a manual inspection of outputs and a multi-axis error categorization to assess the performance of ChatGPT on 709 prompts, later refined in the miniGHOSTS dataset comprising 170 prompts for GPT-4. These involve tasks ranging from undergraduate-level exercises to olympiad-style problems. The researchers found that ChatGPT performs well in mathematical fact retrieval and simple logic but struggles significantly with complex proof-based tasks, reflecting its limited capability to tackle advanced mathematics consistently.

The rating system employed uses a scale from 1 to 5, where a score of 3.5 or higher indicates a satisfactory level of performance. Analyses show that both January 2023 versions of ChatGPT perform below this threshold in most complex categories, whereas GPT-4 shows marked improvements, achieving a mean score of 4.15 on miniGHOSTS.

Comparative Findings

When compared to specialized models like those used in~\cite{lample2019deep} and~\cite{lewkowycz2022solving}, ChatGPT's performance remains inadequate, particularly for symbolic integration and problem-solving tasks that require nuanced mathematical insight. In contrast, GPT-4 exhibits enhanced capabilities in undergraduate-level mathematics, yet remains insufficient for graduate-level problem-solving.

Conclusion and Implications

The paper provides an academically rigorous checkpoint of ChatGPT’s mathematical prowess, outlining its role as a potentially useful tool for mathematical fact retrieval and basic calculations but limited in solving non-trivial mathematical problems. Results indicate a need for continued development of LLMs to improve their mathematical reasoning capabilities. The paper suggests that although GPT-4 shows progress, significant gaps remain, particularly in tasks requiring intricate logical deductions or complex computations.

Future research should focus on expanding datasets like GHOSTS, refining evaluation metrics, and exploring adaptive learning techniques to enhance LLMs' proficiency in advanced mathematical tasks. The findings underscore the transformative potential of LLMs in mathematical domains, provided continuous enhancements are pursued. This paper thus contributes significantly to the discourse on AI's evolving capability in handling specialized knowledge domains, such as advanced mathematics, integrating inputs from cross-disciplinary research.

PDF Markdown

Related Papers

GitHub

GitHub - friederrr/GHOSTS: GHOSTS dataset (38 stars)

Tweets

https://twitter.com/arowx/status/1755322867644764322

https://twitter.com/RiteKitAPI/status/1758405903361908753

https://twitter.com/ScrapeLogo/status/1754264086827680098

https://twitter.com/RiteKitAPI/status/1767039293304742143

https://twitter.com/ScrapeLogo/status/1754583737621205365

https://twitter.com/RiteKitAPI/status/1770269317076029629