Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards (2404.10346v4)
Abstract: Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of LLMs. However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.
- A general theoretical paradigm to understand learning from human preferences.
- Llemma: An open language model for mathematics.
- Self-play fine-tuning converts weak language models to strong language models.
- Training verifiers to solve math word problems.
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- Kto: Model alignment as prospect theoretic optimization.
- Specializing smaller language models towards multi-step reasoning.
- The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
- Reinforced self-training (rest) for language modeling.
- Teaching large language models to reason with reinforcement learning.
- Glore: When, where, and how to improve llm reasoning via global and local refinements.
- Measuring mathematical problem solving with the math dataset.
- Orpo: Monolithic preference optimization without reference model.
- V-star: Training verifiers for self-taught reasoners.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Mistral 7b.
- Learning planning-based reasoning by trajectories collection and process reward synthesizing.
- Cotever: Chain of thought prompting annotation toolkit for explanation verification. arXiv preprint arXiv:2303.03628.
- The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning.
- Understanding the effects of rlhf on llm generalisation and diversity.
- Large language models are zero-shot reasoners.
- Efficient memory management for large language model serving with pagedattention.
- Solving quantitative reasoning problems with language models.
- Common 7b language models already possess strong math capabilities.
- Explanations from large language models make small reasoners better.
- Let’s verify step by step.
- Tinygsm: achieving >80% on gsm8k with small language models.
- Don’t throw away your value model! making ppo even better via value-guided monte-carlo tree search decoding.
- The flan collection: Designing data and methods for effective instruction tuning.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
- Orca 2: Teaching small language models how to reason.
- Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830.
- Orca-math: Unlocking the potential of slms in grade school math.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- Learning math reasoning from self-sampled correct and partially-correct solutions.
- Gpt-4 technical report.
- Smaug: Fixing failure modes of preference optimisation with dpo-positive.
- Direct preference optimization: Your language model is secretly a reward model.
- Proximal policy optimization algorithms.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
- Does knowledge distillation really work?
- Gemini: A family of highly capable multimodal models.
- Openmathinstruct-1: A 1.8 million math instruction tuning dataset.
- Zephyr: Direct distillation of lm alignment.
- Math-shepherd: Verify and reinforce llms step-by-step without human annotations.
- Self-consistency improves chain of thought reasoning in language models.
- Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision.
- Chain-of-thought prompting elicits reasoning in large language models.
- Self-evaluation guided beam search for reasoning.
- Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
- Outcome-supervised verifiers for planning in mathematical reasoning.
- Metamath: Bootstrap your own mathematical questions for large language models.
- Self-rewarding language models.
- Scaling relationship on learning mathematical reasoning with large language models.
- Star: Bootstrapping reasoning with reasoning.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.