MathScale: Scaling Instruction Tuning for Mathematical Reasoning (2403.02884v1)

Published 5 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.

References (32)

Citations (26)

View on Semantic Scholar

Summary

The paper introduces a novel scalable instruction tuning method that generates a two-million QA dataset using concept graphs.
It leverages frontier LLMs to extract topics and concepts from seed questions, significantly enhancing mathematical reasoning.
Evaluation on a comprehensive math benchmark shows MathScale-7B outperforms peers by over 40% in accuracy.

Insights into "MathScale: Scaling Instruction Tuning for Mathematical Reasoning"

The paper "MathScale: Scaling Instruction Tuning for Mathematical Reasoning" presents a methodological approach to enhancing the mathematical reasoning abilities of LLMs through the creation of a scalable and effective dataset. This approach leverages frontier LLMs, like GPT-3.5, to generate high-quality mathematical reasoning data, thereby addressing the limitations imposed by existing datasets like GSM8K and MATH.

MathScale adopts a novel data generation pipeline, reflecting cognitive mechanisms observed in human learners. The critical stages involve topic and concept extraction from seed questions, constructing concept graphs, and synthesizing new math questions from these graphs. This methodology significantly expands the volume of training data, enabling the construction of the MathScaleQA dataset, comprising two million question-answer pairs. The process efficiently decouples data generation from the constraints of limited existing datasets.

The effectiveness of MathScale is evaluated using M WP B ENCH , a benchmark spanning K-12 to competition-level math problems, enabling consistent and fair model comparisons. MathScale-7B, a model fine-tuned using the MathScaleQA dataset, significantly outperforms baseline models of equivalent size, achieving 42.9% higher micro-average accuracy and 43.7% higher macro-average accuracy. These improvements illustrate the scalability and superiority of the approach, as the MathScale-7B surpasses its peers by a substantial margin.

A key facet is its concept graph framework, derived from the mathematical principles of concept compression and connection forging. This aligns with Tall's theory of mathematical learning, suggesting parallels between the MathScale pipeline and effective human learning strategies. By emphasizing the extraction of both "topics" and "knowledge points," MathScale generates a more diverse dataset, which is crucial for better model generalization.

Although the work primarily focuses on natural language reasoning, its implications hint at opportunities for incorporating program-based tool usage, akin to methods like ToRA. However, the current scope is limited to natural language reasoning without integrating tool-based reasoning, reserving this exploration for future developments.

The research points to notable scalability in enhancing large-scale mathematical datasets. The inclusion of diverse mathematical concepts and the effective transformation of raw seed data into structured information positions MathScale favorably as a foundation for future research. It opens avenues for exploring the LLaMA-2 70B model, extending the dataset size beyond two million examples, and the potential of integrating programming tools for comprehensive reasoning capabilities.

In conclusion, MathScale exemplifies a strategic approach to data augmentation in AI-driven mathematical reasoning. While it showcases profound accuracy improvements, the research encourages further exploration into combining cognitive and computational strategies for a truly holistic instruction-tuning approach.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1765210438885691799

https://twitter.com/knishimae0531/status/1765534935946002830