Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Common 7B Language Models Already Possess Strong Math Capabilities (2403.04706v1)

Published 7 Mar 2024 in cs.CL and cs.AI

Abstract: Mathematical capabilities were previously believed to emerge in common LLMs only at a very large scale or require extensive math-related pre-training. This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities, as evidenced by its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. The primary issue with the current base model is the difficulty in consistently eliciting its inherent mathematical capabilities. Notably, the accuracy for the first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks, respectively. We find that simply scaling up the SFT data can significantly enhance the reliability of generating correct answers. However, the potential for extensive scaling is constrained by the scarcity of publicly available math questions. To overcome this limitation, we employ synthetic data, which proves to be nearly as effective as real data and shows no clear saturation when scaled up to approximately one million samples. This straightforward approach achieves an accuracy of 82.6% on GSM8K and 40.6% on MATH using LLaMA-2 7B models, surpassing previous models by 14.2% and 20.8%, respectively. We also provide insights into scaling behaviors across different reasoning complexities and error types.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689.
  3. Anthropic. 2023. Model card and evaluations for claude models.
  4. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  6. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  9. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
  10. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
  11. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452.
  12. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  13. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  14. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  15. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  16. Solving quantitative reasoning problems with language models.
  17. Query and response augmentation cannot help out-of-domain math reasoning generalization. arXiv preprint arXiv:2310.05506.
  18. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  19. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  20. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772.
  21. OpenAI. 2023a. Gpt-3.5 turbo fine-tuning and api updates.
  22. OpenAI. 2023b. GPT-4 technical report. CoRR, abs/2303.08774.
  23. Training language models to follow instructions with human feedback.
  24. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
  25. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
  26. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  27. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  29. Emergent abilities of large language models.
  30. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  31. xAI. 2023. Grok-1.
  32. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  33. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  34. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  35. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena.
  37. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
Citations (48)

Summary

  • The paper shows that synthetic data scaling significantly boosts 7B model performance on math benchmarks by mitigating solution instability.
  • The study uses Pass@256 and PassRatio@256 metrics to highlight the model’s potential capabilities versus its initial answer inconsistencies.
  • Leveraging nearly one million synthetic SFT samples, the research achieves state-of-the-art math accuracy without relying on math-specific pre-training.

Enhancing Mathematical Capabilities of 7B LLMs with Synthetic Data Scaling

Introduction

Emergent capabilities in LLMs, particularly concerning mathematical reasoning, have traditionally been associated with large-scale models exceeding tens of billions of parameters. Recent studies suggested that meaningful performance on mathematical benchmarks could only be achieved with such gargantuan models or those specifically trained on extensive mathematical corpora. However, this paper challenges that notion by demonstrating the inherent mathematical capabilities of a comparatively smaller 7B parameter model, LLaMA-2 7B, without resorting to math-centric pre-training. The paper’s critical insight revolves around the concept that the fundamental issue with existing models is not the lack of capability but the instability in consistently generating correct solutions. The authors propose a solution leveraging synthetic data, showing that it remarkably enhances performance on two major mathematical benchmarks: GSM8K and MATH.

Understanding Mathematical Capabilities in LLaMA-2 7B

The authors' exploration begins with an analysis of the LLaMA-2 7B model's performance on the GSM8K and MATH benchmarks. They employ two metrics for evaluation: Pass@N and PassRatio@N. These metrics reveal an intriguing aspect of the model's behavior; while exhibiting high potential capabilities (Pass@256), the model's inconsistency in producing correct answers on the first attempt (PassRatio@256) indicates an instability issue. Remarkably, when allowed to choose the best answer from 256 trials, the model's accuracy surpasses that of its contemporaries on GSM8K and showcases competitive performance on MATH.

Synthetic Data Scaling to Mitigate Instability

The paper posits that the instability issue can be significantly mitigated by scaling supervised fine-tuning (SFT) data. This assertion is grounded in observations that increasing SFT data leads to linear, or super-linear, improvements in accuracy without saturation. Given the limitation of accessible real math questions for further scaling, the authors turn to synthetic question generation as a solution, harnessing the GPT-4 Turbo model. This approach not only circumvents the scarcity of real questions but also proves nearly as effective, indicating the synthetic data's high quality and relevance.

The authors conduct extensive experiments, scaling SFT data up to approximately one million samples. These experiments illustrate that such scaling directly correlates with marked improvements in the model’s performance, achieving state-of-the-art accuracy on the GSM8K and MATH benchmarks with a 7B model. This outcome firmly establishes that the so-called instability issue can be substantially reduced through the strategic scaling of SFT data.

Implications and Future Directions

This paper's implications extend beyond just improving mathematical abilities in LLMs. It provides a compelling argument against the necessity for extremely large models or specifically pre-trained models to achieve high performance in domain-specific tasks. Instead, it showcases the potential of leveraging synthetic data to uncover and enhance the capabilities of existing models.

Looking forward, the synthetic SFT data scaling approach opens new avenues for research and development across various domains, encouraging a reevaluation of how we perceive and unlock the potential of LLMs. With synthetic data proving to be a valuable resource for model training, future work might explore its application in other specialized areas beyond mathematics, promising further breakthroughs in AI research and applications.

In conclusion, this paper’s exploration into enhancing the mathematical capabilities of the LLaMA-2 7B model via synthetic data scaling not only challenges existing beliefs about model training and capabilities but also sets a precedent for future research in leveraging synthetic data to maximize the potential of LLMs across diverse domains.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 20 tweets and received 498 likes.

Upgrade to Pro to view all of the tweets about this paper:

Reddit Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube