Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards (2404.10346v4)
Abstract: Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of LLMs. However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.
- A general theoretical paradigm to understand learning from human preferences.
- Llemma: An open language model for mathematics.
- Self-play fine-tuning converts weak language models to strong language models.
- Training verifiers to solve math word problems.
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- Kto: Model alignment as prospect theoretic optimization.
- Specializing smaller language models towards multi-step reasoning.
- The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
- Reinforced self-training (rest) for language modeling.
- Teaching large language models to reason with reinforcement learning.
- Glore: When, where, and how to improve llm reasoning via global and local refinements.
- Measuring mathematical problem solving with the math dataset.
- Orpo: Monolithic preference optimization without reference model.
- V-star: Training verifiers for self-taught reasoners.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Mistral 7b.
- Learning planning-based reasoning by trajectories collection and process reward synthesizing.
- Cotever: Chain of thought prompting annotation toolkit for explanation verification. arXiv preprint arXiv:2303.03628.
- The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning.
- Understanding the effects of rlhf on llm generalisation and diversity.
- Large language models are zero-shot reasoners.
- Efficient memory management for large language model serving with pagedattention.
- Solving quantitative reasoning problems with language models.
- Common 7b language models already possess strong math capabilities.
- Explanations from large language models make small reasoners better.
- Let’s verify step by step.
- Tinygsm: achieving >80% on gsm8k with small language models.
- Don’t throw away your value model! making ppo even better via value-guided monte-carlo tree search decoding.
- The flan collection: Designing data and methods for effective instruction tuning.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
- Orca 2: Teaching small language models how to reason.
- Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830.
- Orca-math: Unlocking the potential of slms in grade school math.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- Learning math reasoning from self-sampled correct and partially-correct solutions.
- Gpt-4 technical report.
- Smaug: Fixing failure modes of preference optimisation with dpo-positive.
- Direct preference optimization: Your language model is secretly a reward model.
- Proximal policy optimization algorithms.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
- Does knowledge distillation really work?
- Gemini: A family of highly capable multimodal models.
- Openmathinstruct-1: A 1.8 million math instruction tuning dataset.
- Zephyr: Direct distillation of lm alignment.
- Math-shepherd: Verify and reinforce llms step-by-step without human annotations.
- Self-consistency improves chain of thought reasoning in language models.
- Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision.
- Chain-of-thought prompting elicits reasoning in large language models.
- Self-evaluation guided beam search for reasoning.
- Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
- Outcome-supervised verifiers for planning in mathematical reasoning.
- Metamath: Bootstrap your own mathematical questions for large language models.
- Self-rewarding language models.
- Scaling relationship on learning mathematical reasoning with large language models.
- Star: Bootstrapping reasoning with reasoning.