Emergent Mind

Abstract

Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of LLMs. However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.

Three models compared; Self-Explore outperforms others in GSM8K, MATH benchmarks; best 4-Shot results highlighted.

Overview

  • Self-Explore methodology enhances LLM reasoning through self-generated rationales and fine-grained rewards, addressing the high cost and scalability issues of traditional methods.

  • It involves a two-step process of step-level exploration and subsequent fine-tuning based on positive and negative rationales, improving reasoning capabilities without proprietary model distillation.

  • Empirical evaluations on GSM8K and MATH datasets show significant improvements over Supervised Fine-Tuning (SFT), demonstrating the effectiveness of the Self-Explore approach.

  • The approach suggests future directions for leveraging self-training mechanisms to improve LLMs in mathematical reasoning and potentially other cognitive domains.

Enhancing LLMs' Reasoning Capabilities Through Self-Training: An Insight into Self-Explore

Introduction to Self-Explore

The development of LLMs has increasingly focused on improving reasoning capabilities through various means, including Chain-of-Thought prompting and fine-tuning with human-authored rationales. Despite the effectiveness of these methods, they are often hindered by the high costs and scalability issues associated with generating and acquiring high-quality rationales. Addressing this challenge, the Self-Explore methodology presents a novel approach to enhance the reasoning faculties of LLMs through self-improvement, leveraging fine-grained rewards derived from the model's own generated rationales.

Methodological Overview

Self-Explore operates under a two-fold process, initially employing step-level exploration within the generated rationales to identify errors (referred to as the "first pit") and subsequently utilizing these insights as a basis for fine-tuning. This process involves the creation of a pairwise dataset categorized by positive and negative step-level samples, which is then subjected to preference learning objectives, thus refining the model's reasoning path in a granular manner. Remarkably, Self-Explore demonstrated consistent improvements across three distinct LLMs without depending on distillation from proprietary models.

Empirical Evaluations

The performance of Self-Explore was rigorously tested on the GSM8K and MATH datasets, showcasing substantial improvements over traditional Supervised Fine-Tuning (SFT) across all models. Notably, enhancements of 13.19%, 10.23%, and 11.30% on GSM8K and 1.98%, 3.16%, and 3.54% on MATH were observed for Mistral-7B, Llemma-7B, and Deepseek-Math 7B models, respectively. These results underscore the method's effectiveness, particularly when compared to approaches solely based on outcome-level supervision.

Theoretical Implications and Future Directions

The advent of Self-Explore not only advances the capabilities of LLMs in processing complex reasoning tasks but also illuminates the potential of self-training mechanisms in circumventing the limitations posed by the acquisition of high-quality training data. The approach suggests a promising trajectory towards realizing more autonomous and efficient methods for improving LLMs, potentially extending beyond mathematical reasoning to broader cognitive domains.

Furthermore, the methodology demonstrates the utility of fine-grained, step-level feedback in refining the reasoning processes of LLMs. By focusing on the first incorrect step in a rationale, Self-Explore provides a more targeted and effective learning signal than general outcome-based supervision. This detailed level of feedback could inspire future works to explore similar fine-grained approaches in other domains or for different types of reasoning tasks, potentially leading to broader applications for self-improvement in AI.

Conclusion

Self-Explore represents a significant stride towards enhancing the reasoning capabilities of LLMs through self-improvement. By efficiently leveraging the model's own generated rationales for fine-tuning, it not only overcomes the practical challenges associated with rationale acquisition but also sets a precedent for future research in self-training methodologies. As we continue to explore these avenues, the potential for developing more nuanced and autonomous LLMs becomes increasingly tangible, promising new frontiers in the realm of artificial intelligence reasoning capabilities.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.