MathScale: Scaling Instruction Tuning for Mathematical Reasoning (2403.02884v1)
Abstract: LLMs have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
- Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Corral, M. CORRAL’S VECTOR CALCULUS. 2008.
- Probability and statistics: The science of uncertainty. Macmillan, 2004.
- PAL: Program-aided language models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 10764–10799. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/gao23f.html.
- Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023.
- Grinstead and Snell’s introduction to probability. Chance Project, 2006.
- Guichard, D. Calculus. 2009.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Sequence-level knowledge distillation. In Su, J., Duh, K., and Carreras, X. (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1317–1327, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1139. URL https://aclanthology.org/D16-1139.
- A First Course in Linear Algebra, 2017A version (Lyryx). Lyryx, 2017.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
- Selinger, P. Matrix theory and linear algebra, 2018. URL https://www.mathstat.dal.ca/~selinger/linear-algebra/. An introduction to linear algebra for first or second year university students. Licensed under Creative Commons CC BY 4.0 License. Last updated on October 26, 2018.
- Precalculus. Stitz Zeager Open Source Mathematics, 2013.
- TAL. Tal-scq5k, 2023. URL https://github.com/math-eval/TAL-SCQ5K. GitHub repository.
- Tall, D. How humans learn to think mathematically: Exploring the three worlds of mathematics. Cambridge University Press, 2013.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Trench, W. F. Elementary Differential Equations. Brooks/Cole Thomson Learning, San Antonio, Texas, USA, 2001. URL http://ramanujan.math.trinity.edu/wtrench/texts/TRENCH_DIFF_EQNS_I.PDF. Free Edition 1.01 (December 2013).
- Wallace, T. Beginning and intermediate algebra. 2010.
- Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 845–854, 2017.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
- Evaluating the performance of large language models on gaokao benchmark. 2023.
- Ape210k: A large-scale and template-rich dataset of math word problems. arXiv preprint arXiv:2009.11506, 2020.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.