OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset (2402.10176v2)
Abstract: Recent work has shown the immense potential of synthetically generated datasets for training LLMs, especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and MATH, two popular math reasoning benchmarks, using the recently released and permissively licensed Mixtral model. Our best model, OpenMath-CodeLlama-70B, trained on a subset of OpenMathInstruct-1, achieves a score of 84.6% on GSM8K and 50.7% on MATH, which is competitive with the best gpt-distilled models. We release our code, models, and the OpenMathInstruct-1 dataset under a commercially permissive license.
- Llemma: An Open Language Model For Mathematics.
- How is ChatGPT’s behavior changing over time?
- Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. TMLR.
- Training Verifiers to Solve Math Word Problems.
- Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv.
- PAL: Program-aided Language Models. In ICML, pages 10764–10799. PMLR.
- ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving. In ICLR.
- Textbooks Are All You Need. arXiv.
- Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS.
- Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
- Large Language Models Can Self-Improve. In EMNLP.
- Mistral 7B. arXiv.
- Mixtral of Experts.
- NeMo: a toolkit for building AI applications using neural modules. In Systems for ML Workshop, NeurIPS.
- Solving Quantitative Reasoning Problems with Language Models. In NeurIPS.
- Textbooks Are All You Need II: phi-1.5 technical report. arXiv.
- MARIO: MAth Reasoning with code Interpreter Output – A Reproducible Pipeline.
- Let’s Verify Step by Step. arXiv.
- TinyGSM: achieving> 80% on GSM8k with small language models. arXiv preprint arXiv:2312.09241.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. arXiv.
- WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. arXiv preprint arXiv:2308.09583.
- Lila: A Unified Benchmark for Mathematical Reasoning. In EMNLP.
- Orca: Progressive Learning from Complex Explanation Traces of GPT-4.
- GPT-4 Technical Report.
- Code Llama: Open Foundation Models for Code. arXiv.
- Galactica: A Large Language Model for Science.
- Llama 2: Open Foundation and Fine-Tuned Chat Models.
- MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning. In ICLR.
- Self-consistency improves chain of thought reasoning in language models. In ICLR.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS.
- WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv.
- Outcome-supervised Verifiers for Planning in Mathematical Reasoning. arXiv.
- MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models. In ICLR.
- Scaling Relationship on Learning Mathematical Reasoning with Large Language Models.
- MAmmoTH: Building math generalist models through hybrid instruction tuning. In ICLR.
- STaR: Bootstrapping Reasoning With Reasoning. In NeurIPS.
- Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification. In ICLR.