Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards (2404.10346v4)

Published 16 Apr 2024 in cs.CL

Abstract: Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of LLMs. However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.

References (55)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces Self-Explore, a self-training approach that leverages fine-grained rewards to enhance LLM reasoning on mathematical tasks.
It employs a two-stage process with error detection and preference learning to refine step-level reasoning in LLMs.
Empirical results on GSM8K and MATH datasets show improvements of up to 13.19% and 3.54%, respectively, over standard fine-tuning methods.

Enhancing LLMs' Reasoning Capabilities Through Self-Training: An Insight into Self-Explore

Introduction to Self-Explore

The development of LLMs has increasingly focused on improving reasoning capabilities through various means, including Chain-of-Thought prompting and fine-tuning with human-authored rationales. Despite the effectiveness of these methods, they are often hindered by the high costs and scalability issues associated with generating and acquiring high-quality rationales. Addressing this challenge, the Self-Explore methodology presents a novel approach to enhance the reasoning faculties of LLMs through self-improvement, leveraging fine-grained rewards derived from the model's own generated rationales.

Methodological Overview

Self-Explore operates under a two-fold process, initially employing step-level exploration within the generated rationales to identify errors (referred to as the "first pit") and subsequently utilizing these insights as a basis for fine-tuning. This process involves the creation of a pairwise dataset categorized by positive and negative step-level samples, which is then subjected to preference learning objectives, thus refining the model's reasoning path in a granular manner. Remarkably, Self-Explore demonstrated consistent improvements across three distinct LLMs without depending on distillation from proprietary models.

Empirical Evaluations

The performance of Self-Explore was rigorously tested on the GSM8K and MATH datasets, showcasing substantial improvements over traditional Supervised Fine-Tuning (SFT) across all models. Notably, enhancements of 13.19%, 10.23%, and 11.30% on GSM8K and 1.98%, 3.16%, and 3.54% on MATH were observed for Mistral-7B, Llemma-7B, and Deepseek-Math 7B models, respectively. These results underscore the method's effectiveness, particularly when compared to approaches solely based on outcome-level supervision.

Theoretical Implications and Future Directions

The advent of Self-Explore not only advances the capabilities of LLMs in processing complex reasoning tasks but also illuminates the potential of self-training mechanisms in circumventing the limitations posed by the acquisition of high-quality training data. The approach suggests a promising trajectory towards realizing more autonomous and efficient methods for improving LLMs, potentially extending beyond mathematical reasoning to broader cognitive domains.

Furthermore, the methodology demonstrates the utility of fine-grained, step-level feedback in refining the reasoning processes of LLMs. By focusing on the first incorrect step in a rationale, Self-Explore provides a more targeted and effective learning signal than general outcome-based supervision. This detailed level of feedback could inspire future works to explore similar fine-grained approaches in other domains or for different types of reasoning tasks, potentially leading to broader applications for self-improvement in AI.

Conclusion

Self-Explore represents a significant stride towards enhancing the reasoning capabilities of LLMs through self-improvement. By efficiently leveraging the model's own generated rationales for fine-tuning, it not only overcomes the practical challenges associated with rationale acquisition but also sets a precedent for future research in self-training methodologies. As we continue to explore these avenues, the potential for developing more nuanced and autonomous LLMs becomes increasingly tangible, promising new frontiers in the field of artificial intelligence reasoning capabilities.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ronalhwang/status/1780598536125735355

https://twitter.com/agarwl_/status/1841911703501734242

https://twitter.com/ronalhwang/status/1805576141807493614