Emergent Mind

Abstract

We introduce MAmmoTH, a series of open-source LLMs specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 16% and 32%. Remarkably, our MAmmoTH-7B model reaches 33% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 23%, and the MAmmoTH-34B model achieves 44% accuracy on MATH, even surpassing GPT-4's CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2357–2367, 2019. doi: 10.18653/v1/N19-1245. https://aclanthology.org/N19-1245.

  2. PaLM 2 Technical Report
  3. Constitutional AI: Harmlessness from AI Feedback
  4. Evaluating Large Language Models Trained on Code
  5. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
  6. TheoremQA: A Theorem-driven Question Answering dataset
  7. Scaling Instruction-Finetuned Language Models
  8. Training Verifiers to Solve Math Word Problems
  9. Advancing mathematics by guiding human intuition with ai. Nature, 600(7887):70–74, 2021. https://www.nature.com/articles/s41586-021-04086-x.

  10. QLoRA: Efficient Finetuning of Quantized LLMs
  11. Compositional semantic parsing with LLMs. International Conference on Learning Representations (ICLR), 2023. https://openreview.net/forum?id=gJW8hSGBys8.

  12. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023. https://proceedings.mlr.press/v202/gao23f/gao23f.pdf.

  13. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
  14. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021a. https://openreview.net/forum?id=d7KBjmI3GmQ.

  15. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/be83ab3ecd0db773eb2dc1b0a17836a1-Paper-round2.pdf.

  16. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  523–533, 2014. doi: 10.3115/v1/D14-1058. https://aclanthology.org/D14-1058.

  17. Large language models are zero-shot reasoners. NeurIPS
  18. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597, 2015. doi: 10.1162/tacl˙a˙00160. https://aclanthology.org/Q15-1042.

  19. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1152–1157, 2016. doi: 10.18653/v1/N16-1136. https://aclanthology.org/N16-1136.

  20. Platypus: Quick, Cheap, and Powerful Refinement of LLMs
  21. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022. https://openreview.net/pdf?id=IFXTZERXdM7.

  22. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
  23. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5315–5333, 2023b. https://aclanthology.org/2023.acl-long.291.pdf.

  24. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  158–167, 2017. doi: 10.18653/v1/P17-1015. https://aclanthology.org/P17-1015.

  25. The flan collection: Designing data and methods for effective instruction tuning. ICML, 2023. https://openreview.net/pdf?id=ZX4uS605XV.

  26. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
  27. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1384–1403, 2022. https://aclanthology.org/2022.emnlp-main.90.pdf.

  28. LILA: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5807–5832, 2022a. https://aclanthology.org/2022.emnlp-main.392.

  29. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3505–3523, 2022b. doi: 10.18653/v1/2022.acl-long.246. https://aclanthology.org/2022.acl-long.246.

  30. Orca: Progressive Learning from Complex Explanation Traces of GPT-4
  31. Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations (ICLR), 2023. https://openreview.net/pdf?id=iaYcJKpY2B_.

  32. Show Your Work: Scratchpads for Intermediate Computation with Language Models
  33. GPT-4 Technical Report
  34. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2080–2094, 2021. doi: 10.18653/v1/2021.naacl-main.168. https://aclanthology.org/2021.naacl-main.168.

  35. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  36. Instruction Tuning with GPT-4
  37. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020. https://dl.acm.org/doi/10.5555/3433701.3433727.
  38. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1743–1752, 2015. doi: 10.18653/v1/D15-1202. https://aclanthology.org/D15-1202.

  39. Code Llama: Open Foundation Models for Code
  40. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. https://openreview.net/forum?id=9Vrb9D0WI4.

  41. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
  42. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  43. Galactica: A Large Language Model for Science
  44. LLaMA: Open and Efficient Foundation Language Models
  45. Llama 2: Open Foundation and Fine-Tuned Chat Models
  46. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2714–2730. Association for Computational Linguistics, 2022a. https://aclanthology.org/2022.emnlp-main.174.

  47. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2717–2739. Association for Computational Linguistics, 2023a. doi: 10.18653/v1/2023.acl-long.153. https://aclanthology.org/2023.acl-long.153.

  48. Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate
  49. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
  50. Making Large Language Models Better Reasoners with Alignment
  51. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
  52. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023f. https://openreview.net/pdf?id=1PL1NIMMrw.

  53. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, 2022b. https://aclanthology.org/2022.emnlp-main.340.

  54. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
  55. Self-instruct: Aligning language model with self generated instructions. The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023h. https://aclanthology.org/2023.acl-long.754.pdf.

  56. CodeT5+: Open Code Large Language Models for Code Understanding and Generation
  57. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022a. https://openreview.net/forum?id=gEZrGCozdqR.

  58. Chain-of-thought prompting elicits reasoning in LLMs. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b. https://openreview.net/pdf?id=_VjQlMeSB_J.

  59. Simple synthetic data reduces sycophancy in large language models
  60. HuggingFace's Transformers: State-of-the-art Natural Language Processing
  61. An explanation of in-context learning as implicit bayesian inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. https://openreview.net/forum?id=RdJVFCHjUMI.

  62. Self-Evaluation Guided Beam Search for Reasoning
  63. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  64. GPT Can Solve Mathematical Problems Without a Calculator
  65. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. https://openreview.net/pdf?id=WE_vluYUL-X.

  66. CrossFit: A few-shot learning challenge for cross-task generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7163–7189, 2021. doi: 10.18653/v1/2021.emnlp-main.572. https://aclanthology.org/2021.emnlp-main.572.

  67. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
  68. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
  69. OPT: Open Pre-trained Transformer Language Models
  70. Progressive-Hint Prompting Improves Reasoning in Large Language Models
  71. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  72. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
  73. Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
  74. LIMA: Less Is More for Alignment
  75. Least-to-most prompting enables complex reasoning in LLMs. International Conference on Learning Representations (ICLR), 2023c. https://openreview.net/pdf?id=WZH7099tgfM.

Show All 75