Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs (2407.03181v2)

Published 3 Jul 2024 in cs.CL

Abstract: Requiring a LLM to generate intermediary reasoning steps, known as Chain of Thought (CoT), has been shown to be an effective way of boosting performance. Previous approaches have focused on generating multiple independent CoTs, combining them through ensembling or other post-hoc strategies to enhance reasoning. In this work, we introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step, which is fundamentally different from prior work that primarily operate on parallel CoT generations. DCoT allows LLMs to gain the ability to perform within-inference refinement of reasoning chains without requiring external feedback. Through a rigorous set of experiments spanning a wide range of tasks that require various reasoning types, we show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales (1.3B to 70B). These improvements are particularly impactful for tasks with a large result state space, such as those involving numeric answers. Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain by generating a second, improved chain within the same inference step, demonstrating previously elusive self-improvement. Our code and data are publicly available at https://github.com/UKPLab/acl2025-diverse-cot.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces Divergent Chain of Thought, a fine-tuning method that enables LLMs to generate multiple reasoning paths and self-correct for enhanced accuracy.
The paper demonstrates that even smaller LLMs benefit from DCoT, achieving significant performance improvements across tasks including mathematics and multi-hop reasoning.
The paper highlights DCoT's potential to democratize AI by enabling self-correction without external feedback, thereby broadening the applicability of LLMs.

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in LLMs

This paper introduces a novel approach to enhancing the reasoning capabilities of LLMs titled Divergent Chain of Thought (DCoT). This method builds on the foundation of Chain of Thought (CoT) prompting, which improves performance by generating intermediate reasoning steps. However, the proposed DCoT method advances this approach by facilitating the generation and comparison of multiple reasoning chains in a single inference step, thus potentially increasing accuracy in the final solutions provided by the model.

Methodology

The key innovation of the DCoT framework is in its ability to instruct LLMs to produce several divergent reasoning paths before arriving at a final decision. This is inspired by the cognitive theories of Divergent and Convergent Thinking, which suggest a multi-phase approach to problem-solving. The process involves generating numerous ideas (divergent phase) and synthesizing them to derive a single solution (convergent phase).

For implementation, DCoT requires fine-tuning models with datasets that contain multiple reasoning paths per question, allowing the model to learn how to generate and select among various potential solutions. This methodology addresses the limitation faced by prior models which could not generate multiple inference chains simultaneously due to the complexity of the task.

Results

The experimentation spanned across models with parameter sizes ranging from 1.3B to 70B, demonstrating consistent improvement over baseline CoT models. Notably, the empirical results substantiate that even smaller, more accessible LLMs benefit from this fine-tuning approach. The performance boost was significant across a variety of tasks, indicative of the method's broad applicability.

Quantitatively, the work showed improvements in task performance across various datasets, including mathematics, logic, and multi-hop reasoning tasks. Furthermore, the introduction of DCoT allowed some models to enhance their accuracy without additional external feedback, indicating a self-correcting capability—a novel advancement in the field.

Implications and Future Directions

The implications of this research are multifaceted. Practically, the introduction of DCoT empowers smaller models to achieve enhanced performance, making high-quality reasoning tasks more accessible without requiring extensive computational resources. This democratizes access to powerful AI and broadens the range of applications for which these LLMs can be effectively utilized.

Theoretically, the success of this method suggests that further exploration into divergent thinking strategies might unlock additional reasoning capabilities in LLMs. The framework presents a new paradigm where multi-step reasoning does not rely solely on external oversight or feedback loops.

Future research may explore the integration of DCoT within larger, more context-rich models or alternative reasoning paradigms such as code prompting or graph-based reasoning. Additionally, investigating the differential impacts of various scales of divergent reasoning (i.e., number of reasoning chains generated) could offer deeper insights into optimizing model training and inference strategies.

This research underscores the value of fine-tuning with complex reasoning data and sets the stage for subsequent advancements in enhancing AI reasoning through refined model training techniques.