Emergent Mind

TinyGSM: achieving >80% on GSM8k with small language models

(2312.09241)
Published Dec 14, 2023 in cs.LG and cs.CL

Abstract

Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce \texttt{TinyGSM}, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on \texttt{TinyGSM}, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset \texttt{TinyGSM}, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.

Graph showing improved Pass@1 results on GSM8K set by selecting answers with top verifier scores.

Overview

  • The paper debates the need for LLMs by demonstrating that small models can achieve high performance on grade school math problems.

  • TinyGSM dataset introduced, consisting of synthetic math problems created by GPT-3.5, used to train a pair of 1.3B parameter models.

  • These smaller models achieved 81.5% accuracy on GSM8K, surpassing larger models and showing the efficacy of quality data and fine-tuning.

  • A verifier model aids performance by selecting probable solutions, with training diversity impacting more than model scaling.

  • The study challenges current beliefs about model size and problem-solving, contributes a new dataset, and highlights verifier model benefits.

Introduction

In the field of AI, particularly within the domain of language models (LMs), there continues to be a debate regarding the necessity of large model sizes for complex problem-solving. An especially intriguing area of application is the ability of these models to solve grade school math problems, which require a blend of mathematical reasoning and language understanding. The canonical benchmark for assessing this capability in models is the GSM8K dataset, which is challenging even for LLMs.

TinyGSM and Verifier Model

The research paper presents the TinyGSM dataset, consisting of 12.3 million high-quality synthetic grade school math problems paired with Python solutions, all generated by GPT-3.5. When used to fine-tune a pair of more modestly-sized 1.3 billion parameter models (a generation model and an independent verifier model), an accuracy of 81.5% was achieved on the GSM8K benchmark. This level of performance surpasses even much larger models and is significant because it demonstrates that smaller models, with the right training data and strategies, can display advanced problem-solving capabilities comparable to their much larger counterparts.

Training and Performance

The researchers reveal that the smaller models, when fine-tuned on the TinyGSM dataset, perform remarkably well with the smallest 125M models attaining a 63.1% accuracy on the GSM8K test set. The study sheds light on two important elements for this performance: firstly, the high-quality dataset and secondly, the employment of a verifier model that selects the most probable solutions from various candidate generations. Interestingly, the data diversity of the verifier's training seems to be more impactful than merely scaling up the generation model, pointing to more efficient parameter usage in the verifier.

Potential and Contributions

This work challenges prevailing notions that LLMs need to be large to be effective problem solvers, especially in mathematical reasoning. Not only does it open up new avenues for using smaller, more computationally-friendly models in various applications, but it also contributes a synthetic dataset that could prove invaluable for future research. Additionally, this study offers insights into the importance of verifier models and diverse training data. Future research could explore different solution formats and further investigate the intriguing relationship between the sizes of generation models and verifiers.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube