Common 7B Language Models Already Possess Strong Math Capabilities (2403.04706v1)
Abstract: Mathematical capabilities were previously believed to emerge in common LLMs only at a very large scale or require extensive math-related pre-training. This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities, as evidenced by its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. The primary issue with the current base model is the difficulty in consistently eliciting its inherent mathematical capabilities. Notably, the accuracy for the first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks, respectively. We find that simply scaling up the SFT data can significantly enhance the reliability of generating correct answers. However, the potential for extensive scaling is constrained by the scarcity of publicly available math questions. To overcome this limitation, we employ synthetic data, which proves to be nearly as effective as real data and shows no clear saturation when scaled up to approximately one million samples. This straightforward approach achieves an accuracy of 82.6% on GSM8K and 40.6% on MATH using LLaMA-2 7B models, surpassing previous models by 14.2% and 20.8%, respectively. We also provide insights into scaling behaviors across different reasoning complexities and error types.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689.
- Anthropic. 2023. Model card and evaluations for claude models.
- Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
- Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
- Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Solving quantitative reasoning problems with language models.
- Query and response augmentation cannot help out-of-domain math reasoning generalization. arXiv preprint arXiv:2310.05506.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
- A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772.
- OpenAI. 2023a. Gpt-3.5 turbo fine-tuning and api updates.
- OpenAI. 2023b. GPT-4 technical report. CoRR, abs/2303.08774.
- Training language models to follow instructions with human feedback.
- Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Emergent abilities of large language models.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- xAI. 2023. Grok-1.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.