We introduce MAmmoTH, a series of open-source LLMs specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 16% and 32%. Remarkably, our MAmmoTH-7B model reaches 33% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 23%, and the MAmmoTH-34B model achieves 44% accuracy on MATH, even surpassing GPT-4's CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Generate a detailed summary of this paper with a premium account.
We ran into a problem analyzing this paper.
MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2357–2367, 2019. doi: 10.18653/v1/N19-1245. https://aclanthology.org/N19-1245.
Advancing mathematics by guiding human intuition with ai. Nature, 600(7887):70–74, 2021. https://www.nature.com/articles/s41586-021-04086-x.
Compositional semantic parsing with LLMs. International Conference on Learning Representations (ICLR), 2023. https://openreview.net/forum?id=gJW8hSGBys8.
Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023. https://proceedings.mlr.press/v202/gao23f/gao23f.pdf.
Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021a. https://openreview.net/forum?id=d7KBjmI3GmQ.
Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/be83ab3ecd0db773eb2dc1b0a17836a1-Paper-round2.pdf.
Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 523–533, 2014. doi: 10.3115/v1/D14-1058. https://aclanthology.org/D14-1058.
Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597, 2015. doi: 10.1162/tacl˙a˙00160. https://aclanthology.org/Q15-1042.
MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1152–1157, 2016. doi: 10.18653/v1/N16-1136. https://aclanthology.org/N16-1136.
Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022. https://openreview.net/pdf?id=IFXTZERXdM7.
Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5315–5333, 2023b. https://aclanthology.org/2023.acl-long.291.pdf.
Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 158–167, 2017. doi: 10.18653/v1/P17-1015. https://aclanthology.org/P17-1015.
The flan collection: Designing data and methods for effective instruction tuning. ICML, 2023. https://openreview.net/pdf?id=ZX4uS605XV.
Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1384–1403, 2022. https://aclanthology.org/2022.emnlp-main.90.pdf.
LILA: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5807–5832, 2022a. https://aclanthology.org/2022.emnlp-main.392.
NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3505–3523, 2022b. doi: 10.18653/v1/2022.acl-long.246. https://aclanthology.org/2022.acl-long.246.
Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations (ICLR), 2023. https://openreview.net/pdf?id=iaYcJKpY2B_.
Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2080–2094, 2021. doi: 10.18653/v1/2021.naacl-main.168. https://aclanthology.org/2021.naacl-main.168.
Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1743–1752, 2015. doi: 10.18653/v1/D15-1202. https://aclanthology.org/D15-1202.
Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. https://openreview.net/forum?id=9Vrb9D0WI4.
Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2714–2730. Association for Computational Linguistics, 2022a. https://aclanthology.org/2022.emnlp-main.174.
Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2717–2739. Association for Computational Linguistics, 2023a. doi: 10.18653/v1/2023.acl-long.153. https://aclanthology.org/2023.acl-long.153.
Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023f. https://openreview.net/pdf?id=1PL1NIMMrw.
Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085–5109, 2022b. https://aclanthology.org/2022.emnlp-main.340.
Self-instruct: Aligning language model with self generated instructions. The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023h. https://aclanthology.org/2023.acl-long.754.pdf.
Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022a. https://openreview.net/forum?id=gEZrGCozdqR.
Chain-of-thought prompting elicits reasoning in LLMs. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b. https://openreview.net/pdf?id=_VjQlMeSB_J.
An explanation of in-context learning as implicit bayesian inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. https://openreview.net/forum?id=RdJVFCHjUMI.
React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. https://openreview.net/pdf?id=WE_vluYUL-X.
CrossFit: A few-shot learning challenge for cross-task generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7163–7189, 2021. doi: 10.18653/v1/2021.emnlp-main.572. https://aclanthology.org/2021.emnlp-main.572.
Least-to-most prompting enables complex reasoning in LLMs. International Conference on Learning Representations (ICLR), 2023c. https://openreview.net/pdf?id=WZH7099tgfM.