Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Self-Evaluation Guided Beam Search for Reasoning (2305.00633v3)

Published 1 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Breaking down a problem into intermediate steps has demonstrated impressive performance in LLM reasoning. However, the growth of the reasoning chain introduces uncertainty and error accumulation, making it challenging to elicit accurate final results. To tackle this challenge of uncertainty in multi-step reasoning, we introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of LLMs. We propose a decoding algorithm integrating the self-evaluation guidance via stochastic beam search. The self-evaluation guidance serves as a better-calibrated automatic criterion, facilitating an efficient search in the reasoning space and resulting in superior prediction quality. Stochastic beam search balances exploitation and exploration of the search space with temperature-controlled randomness. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K, AQuA, and StrategyQA benchmarks, respectively. Experiment results with Llama-2 on arithmetic reasoning demonstrate the efficiency of our method in outperforming the baseline methods with comparable computational budgets. Further analysis in multi-step reasoning finds our self-evaluation guidance pinpoints logic failures and leads to higher consistency and robustness. Our code is publicly available at https://guideddecoding.github.io/.

Citations (72)

Summary

  • The paper introduces a novel self-evaluation guided beam search method that decomposes multi-step reasoning tasks into intermediate steps to reduce uncertainty and error buildup.
  • The methodology integrates stochastic beam search with temperature-controlled randomness, balancing exploitation and exploration to enhance prediction accuracy.
  • Empirical results demonstrate significant accuracy improvements on arithmetic, symbolic, and commonsense benchmarks despite added computational overhead.

Self-Evaluation Guided Beam Search for Reasoning

The paper "Self-Evaluation Guided Beam Search for Reasoning" (2305.00633) introduces a novel approach to enhance reasoning processes in LLMs by integrating self-evaluation guidance within stochastic beam search. This innovative method focuses on breaking down complex reasoning tasks into intermediate steps, which reduces uncertainty and error accumulation while facilitating more accurate predictions.

Introduction to Multi-step Reasoning Challenges

LLMs have demonstrated significant capabilities in reasoning across various tasks through techniques like few-shot prompting. However, increasing the complexity and length of reasoning chains introduces challenges such as error accumulation and uncertainty, which can hinder the final accuracy of predictions. The presented solution tackles these issues by employing self-evaluation mechanisms to guide the beam search process in multi-step reasoning tasks.

The authors propose a framework that utilizes self-evaluation guided stochastic beam search to enhance LLM reasoning. This framework decomposes reasoning into intermediate steps, allowing for a more granular evaluation of each step's correctness. The process is illustrated through a decoding algorithm that balances exploitation and exploration within the reasoning space using temperature-controlled randomness. Figure 1

Figure 1: Self-Evaluation can calibrate the decoding direction in multi-step reasoning.

Figure 2

Figure 2: Our framework of self-evaluation guided stochastic beam search for multi-step reasoning.

Decoding Process with Self-Evaluation

The decoding process treats each intermediate reasoning step as a sequence of tokens, allowing for the application of beam search strategies tailored to enhance reasoning accuracy. A constraint function is introduced to evaluate the LLM's confidence in the correctness of each reasoning step. This confidence is combined with the LLM's generation probability to form a new decoding objective function.

The approach incorporates controllable randomness through stochastic beam search, balancing the quality-diversity trade-off in generating reasoning chains. The combination of self-evaluation and temperature-controlled randomness effectively improves final prediction quality across reasoning tasks.

Empirical Results

The proposed method was evaluated on benchmarks across arithmetic, symbolic, and commonsense reasoning tasks. It demonstrated significant improvements in model accuracy over baseline methods:

  • Arithmetic Reasoning: On GSM8K, AQuA, and SVAMP, absolute accuracy increases of 6.34%6.34\%, 9.56%9.56\%, and 5.46%5.46\% were observed, respectively.
  • Symbolic and Commonsense Reasoning: Consistent performance gains were achieved, demonstrating the method's effectiveness in navigating complex reasoning chains. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: PAL Prompting Methods on GSM8K.

Figure 4

Figure 4

Figure 4: Effect of beam size.

Figure 5

Figure 5

Figure 5: Examples of self-evaluation score distribution of different predictions on the GSM8K dataset.

Discussion on Computational Costs and Limitations

Despite the empirical success, the method introduces an overhead of computational cost due to the additional sampling required for self-evaluation and candidate generation. However, the approach remains efficient for longer reasoning chains, where stepwise calibration proves beneficial in improving overall accuracy. Figure 6

Figure 6: Accuracy curves with different sampling diversity.

Conclusion

Self-evaluation guided beam search is a promising approach to enhance reasoning in LLMs by reducing error rates in multi-step tasks. It offers insights into leveraging LLMs' self-evaluation capabilities to refine logical reasoning, applicable in various domains requiring complex reasoning and decision-making. Future advancements may explore integrating external tools for improved calibration and generalization across multi-step reasoning scenarios.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com