Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 28 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Improve Mathematical Reasoning in Language Models by Automated Process Supervision (2406.06592v2)

Published 5 Jun 2024 in cs.CL and cs.LG

Abstract: Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced LLMs. Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named \textit{OmegaPRM} for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train Process Reward Models (PRMs). This fully automated process supervision alongside the weighted self-consistency algorithm is able to enhance LLMs' math reasoning performances. We improved the success rates of the instruction-tuned Gemini Pro model from 51\% to 69.4\% on MATH500 and from 86.4\% to 93.6\% on GSM8K. Similarly, we boosted the success rates of Gemma2 27B from 42.3\% to 58.2\% on MATH500 and from 74.0\% to 92.2\% on GSM8K. The entire process operates without any human intervention or supervision, making our method both financially and ...

Citations (49)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces OmegaPRM, a novel MCTS-based approach that automates process supervision for intermediate reasoning steps in LLMs.
  • It employs a divide-and-conquer strategy to annotate over 1.5 million process annotations, leading to a 36% improvement on the MATH benchmark.
  • The study demonstrates scalable, cost-effective process supervision that enhances LLM reasoning in complex mathematical tasks.

Automated Process Supervision for Enhanced Mathematical Reasoning in LLMs

Improving the mathematical reasoning capabilities of LLMs represents a significant research challenge, particularly for tasks demanding complex multi-step reasoning such as solving math problems or coding. This paper explores the application of process supervision to refine the intermediate reasoning steps of LLMs through a novel Monte Carlo Tree Search (MCTS) algorithm named OmegaPRM.

Process Supervision and its Implementation

Chain-of-Thought (CoT) prompting has demonstrated efficacy in breaking down reasoning tasks into sequential steps, mimicking human cognitive processes. However, CoT’s performance can be hindered by issues related to greedy decoding strategies. Existing methods, such as self-consistency prompting and supervised fine-tuning with question-solution pairs, provide enhancements but fail to comprehensively address these limitations due to the inadequacy in rewarding intermediate steps.

Outcome Reward Models (ORM) address the need to verify output correctness, yet they focus on final answer verification rather than supervising the intermediate reasoning steps. Addressing this gap, Process Reward Models (PRM) introduce the concept of process supervision which provides more granular feedback by rewarding or penalizing each reasoning step, thereby enhancing the model's reasoning capabilities.

Developing high-quality process supervision data historically relied on human annotations or computationally intensive per-step Monte Carlo Estimation methods. To overcome these challenges, the paper presents OmegaPRM, an MCTS-based approach that automates this data collection process effectively.

Monte Carlo Tree Search (MCTS) and OmegaPRM

OmegaPRM adapts the MCTS algorithm for process supervision. It uses a novel divide-and-conquer strategy to efficiently locate and annotate the first error in a CoT reasoning tree. This method allows the generation of more than 1.5 million process annotations, forming the basis for training a highly effective PRM. Figure 1

Figure 1: Example tree structure built with our proposed OmegaPRM algorithm.

The algorithm constructs a reasoning path tree, where each node represents partial solutions within the CoT framework, evaluated using Monte Carlo simulations (Figure 2). Figure 2

Figure 2

Figure 2

Figure 2: Monte Carlo estimation of a prefix solution.

This approach not only improves annotation efficiency but also offers a scalable solution readily applicable to various LLM reasoning tasks, significantly reducing reliance on extensive human supervision.

Performance Evaluation

An empirical analysis demonstrates that PRMs utilizing OmegaPRM-derived annotations outperform alternative datasets such as PRM800K and Math-Shepherd in reasoning accuracy. The weighted self-consistency algorithm, combined with PRM outcomes, achieved a notable 69.4% success rate on the MATH benchmark, illustrating a 36% improvement over the baseline model performance. Figure 3

Figure 3: A comparison of PRMs trained with different process supervision datasets, evaluated by a PRM-weighted majority voting.

Additionally, classical tree search-based state transition models and data-driven heuristics ensure optimal tree traversal while balancing exploration and exploitation decisions essential for reasoning step verification.

Implications and Future Prospects

OmegaPRM highlights the potential of automated process annotation to significantly elevate the reasoning abilities of LLMs in complex mathematical tasks. The efficiency of the OmegaPRM method marks an advancement towards economically viable and computationally efficient process supervision frameworks.

In terms of future explorations, integrating human insights with automated annotations could offer comprehensive and nuanced process supervision. Further adaptations may encompass extending the OmegaPRM utility beyond structured tasks, addressing open-ended challenges.

Conclusion

OmegaPRM represents a critical step toward advancing the reasoning proficiency of LLMs by leveraging an innovative automation strategy in process supervision. This work paves the pathway for future research into cost-effective, high-quality data collection methods capable of scaling up integrated reasoning models, potentially transforming their applicability in multifaceted domains.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com