How well do Large Language Models perform in Arithmetic tasks? (2304.02015v1)

Published 16 Mar 2023 in cs.CL and cs.AI

Abstract: LLMs have emerged abilities including chain-of-thought to answer math word problems step by step. Solving math word problems not only requires abilities to disassemble problems via chain-of-thought but also needs to calculate arithmetic expressions correctly for each step. To the best of our knowledge, there is no work to focus on evaluating the arithmetic ability of LLMs. In this work, we propose an arithmetic dataset MATH 401 to test the latest LLMs including GPT-4, ChatGPT, InstrctGPT, Galactica, and LLaMA with various arithmetic expressions and provide a detailed analysis of the ability of LLMs. MATH 401 and evaluation codes are released at \url{https://github.com/GanjinZero/math401-LLM}.

Citations (99)

View on Semantic Scholar

Summary

The paper evaluates arithmetic capabilities across multiple LLMs using the comprehensive MATH 401 dataset to benchmark accuracy from basic operations to complex calculations.
The paper demonstrates that GPT-4 and ChatGPT outperform others, emphasizing the impact of training corpus diversity, tokenization strategies, and fine-tuning techniques.
The findings reveal that while scaling LLMs generally improves arithmetic performance, strategic enhancements in prompt engineering and instructional data are crucial for robust numerical reasoning.

Performance of LLMs in Arithmetic Tasks: An Analysis

The paper "How well do LLMs perform in Arithmetic tasks?" by Zheng Yuan et al., examines the arithmetic capabilities of various state-of-the-art LLMs. Recognizing that solving arithmetic problems is critical for successfully answering math word problems, the researchers introduce a comprehensive dataset named MATH 401. This dataset evaluates models across a range of arithmetic operators and numeric types, shedding light on LLMs' numerical computation abilities.

The researchers evaluate prominent LLMs, including GPT-4, ChatGPT, InstructGPT, Galactica, and LLaMA, using MATH 401, which presents arithmetic challenges involving basic operations like addition and subtraction, as well as more complex tasks such as exponentiation, trigonometric functions, and logarithms. GPT-4 and ChatGPT achieved superior results, outperforming others with substantial margins in accuracy across these arithmetic tasks.

Dataset and Evaluation

The dataset underscores varying difficulty levels, from simple operations like addition of small integers to complex calculations involving irrational numbers and logarithmic functions. Accuracy was determined by comparing the models' outputs to target solutions within a particular tolerance level.

The GPT-4 model notably excelled, achieving the highest scores across all groups, demonstrating a well-rounded proficiency in handling a diverse range of arithmetic expressions. ChatGPT, following closely, demonstrated significant capabilities too, particularly in tasks like long arithmetic expressions and computations involving irrational numbers, although it faced challenges with large number multiplications and specific functions.

Key Findings

Results revealed several factors that impact LLM arithmetic performance:

Tokenization: It was observed that digit-level tokenization employed by Galactica and LLaMA models contributed to their arithmetic performance. This approach redistributes token frequency, which could be beneficial for arithmetic understanding.
Training Corpus: A diverse pre-training corpus enriched with code and mathematical data (e.g., LATEX sources) significantly boosts arithmetic skills. Galactica's success in arithmetic can be partially attributed to its extensive LATEX pre-training, while code-specific models like Code-davinci-002 showed moderate arithmetic prowess.
Instruction Fine-tuning and RLHF: Fine-tuning with an instructional dataset improves performance. For example, RLHF-enhanced models such as text-davinci-003 outperformed their counterparts.
Prompting Strategy: Using structured prompts was critical. Specific prompts significantly enhanced model outputs, underscoring the importance of careful prompt engineering.

Scaling and Model Size

Analysis of different model sizes highlighted that an increase in parameter count generally correlates with improved arithmetic abilities, but with diminishing returns beyond a certain threshold (around 30 billion parameters). This suggests that scaling alone may be insufficient for substantial arithmetic improvements beyond this scale, particularly when considering ChatGPT's impressive capabilities despite unknown parameter details.

Conclusion and Implications

The paper provides substantial insights into the arithmetic capabilities of LLMs, a foundational aspect for solving more intricate math problems. This evaluation identifies areas where LLMs excel and where they struggle, guiding future research to enhance these models further.

Future research should explore additional mathematical domains like calculus and algebra, leveraging the understanding gleaned from arithmetic evaluations. The integration of arithmetic skills with symbolic reasoning could enhance LLMs' applicability in more sophisticated scientific and mathematical contexts. There is a need for continued exploration of prompt engineering techniques and instructional data to bolster LLMs' operational efficiency across various domains. This paper sets a foundational benchmark for assessing LLMs' numerical skills, shaping ongoing research in LLM development and fine-tuning methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - GanjinZero/math401-llm: Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks? (57 stars)