GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Published 13 Feb 2024 in cs.CL and cs.LG | (2402.10963v2)

Abstract: State-of-the-art LLMs can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (\textbf{PRMs}), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (\textbf{SORMs}) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train \textit{global} refinement models, which take only the question and a draft solution as input and predict a corrected solution, and \textit{local} refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled.

Abstract PDF HTML Upgrade to Chat

Citations (22)

View on Semantic Scholar

Summary

The paper proposes Stepwise ORMs (SORMs) to overcome LLM reasoning limitations by detecting erroneous intermediate steps.
It employs a three-stage methodology to determine when, where, and how to refine responses through global and local corrections.
Evaluations on benchmarks like GSM8K and SVAMP show that combining ORM with SORM significantly enhances solution correctness.

Introduction

The paper "GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements" addresses the limitations of LLMs in reasoning tasks, specifically their inability to refine solutions without explicit external feedback. Traditional Outcome-based Reward Models (ORMs) have shown efficacy in determining when to refine a solution but are limited in assessing intermediate steps where errors might occur. This research introduces Stepwise ORMs (SORMs), which improve upon traditional ORMs by better detecting incorrect reasoning steps and providing mechanisms for both global and local refinement models to correct these errors.

The proposed method decomposes the task of reasoning refinement into three distinct stages: determining when to refine, identifying where to refine, and deciding how to refine. ORMs are traditionally used to predict the correctness of the final answer, but their influence on step-wise refinement is limited due to their over-pessimism when evaluating intermediate steps. To overcome this, the paper suggests the use of SORMs, which are trained solely on synthetic data to approximate the expected future reward of an optimal policy.

Figure 1: Diagram for three-stage refinement training pipeline.

The SORM is designed to predict the correctness of each step by training on data generated from the student policy, which samples the steps repeatedly. This allows the SORM to offer more accurate feedback on intermediate solutions compared to ORMs. By pairing incorrect and correct solutions, SORM helps in training refiners that can perform both globally, adjusting strategies from scratch, and locally, correcting specific errors in reasoning steps.

Evaluation and Results

Evaluation involves testing both ORM and SORM models on benchmarks such as GSM8K and SVAMP, utilizing LLaMA-2 models. The SORM model demonstrated superior performance in evaluating intermediate steps, thereby enhancing refinement accuracy significantly, particularly on challenging datasets.

Figure 2: Example of local and global refinements on a math word problem.

Performance metrics indicate that while traditional ORMs excel in predicting final answer correctness, SORMs enhance the step-wise reasoning assessment. Global and local refinements complement each other by addressing different problem sets. When combined, they substantially improve solution correctness, indicating their utility in varying complexity scenarios across benchmarks. The synergistic use of SORM for localization and ORMs for reranking results in marked improvements in reasoning accuracy.

Conclusion

This paper's contributions lay in decomposing the reasoning refinement problem into structured stages, highlighting ORM limitations, and introducing SORMs for enhanced reasoning capabilities. The results show substantial accuracy improvements, demonstrating the potential of SORMs and combined refinement strategies in advancing LLM reasoning performance. Future work could further enhance the reliability of these models by integrating more sophisticated error critique mechanisms and extending the dataset diversity for broader generalization. As such, SORM appears to be an advantageous direction for further research in synthesizing step-wise value functions and improving overall LLM utility in complex reasoning tasks.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Summary

Introduction

Refinement Process

Evaluation and Results

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (7)

Collections

Tweets

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Summary