Emergent Mind

Abstract

Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced LLMs. Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named \textit{OmegaPRM} for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM). Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from the 51\% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods.

Comparison of PRMs trained with different datasets, showing variance in solution search effectiveness.

Overview

  • The paper introduces the OmegaPRM algorithm, a novel Monte Carlo Tree Search approach designed to enhance the collection of high-quality process supervision annotations, improving the training of Process Reward Models (PRMs) for LLMs.

  • Using the OmegaPRM algorithm, researchers collected over 1.5 million process annotations automatically, which led to improvements in an instruction-tuned model, significantly boosting performance on the MATH benchmark.

  • The study demonstrates a 36% relative improvement in success rates for the Gemini Pro model on complex mathematical reasoning tasks, highlighting the potential for scalable and automated methods to enhance LLM capabilities.

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

The paper, "Improve Mathematical Reasoning in Language Models by Automated Process Supervision," authored by researchers at Google DeepMind and Google, introduces a novel approach to enhance the mathematical reasoning capabilities of LLMs. The methodology leverages a divide-and-conquer Monte Carlo Tree Search (MCTS) algorithm, named OmegaPRM, to improve the efficiency and quality of process supervision data collection, subsequently training a Process Reward Model (PRM) to boost LLM performance.

Background

Complex multi-step reasoning tasks, such as mathematical problem-solving and code generation, present significant challenges for LLMs. Approaches like Chain-of-Thought (CoT) prompting and self-consistency strategies have shown improvements, yet they often fall short for tasks requiring lengthy or multi-hop reasoning. Previous methods to collect process supervision data have been either manually intensive or computationally expensive, limiting their scalability.

Contributions

The paper introduces several key advancements:

  1. OmegaPRM Algorithm: A novel MCTS algorithm tailored for generating high-quality process supervision annotations. Unlike existing techniques, OmegaPRM efficiently identifies the first error in the reasoning chain, thereby enhancing the quality of data collected.
  2. Automated and Scalable Process Supervision: The proposed method enables the collection of over 1.5 million annotations without human intervention, making the process both financially and computationally efficient.
  3. Improved Instruction-Tuned Model: The researchers integrated their process supervision data with a weighted self-consistency algorithm, leading to significant performance improvements on the MATH benchmark.
  4. Empirical Validation: They demonstrated a 36% relative improvement in success rates for the Gemini Pro model on complex mathematical reasoning tasks.

Methodology

Process Supervision

The approach involves training a PRM to predict the correctness of each step in the reasoning process. Traditional methods require human annotations or expensive per-step Monte Carlo estimation. The proposed OmegaPRM algorithm instead uses a divide-and-conquer technique to identify incorrect steps more efficiently.

Monte Carlo Tree Search

The MCTS algorithm creates a tree structure where each node represents a state (a question and its corresponding partial solution), and edges represent state transitions guided by LLM policy. The OmegaPRM algorithm involves three phases:

  1. Select: Utilizes a modified PUCT algorithm to select rollouts based on a state-rollout value function.
  2. Binary Search: Efficiently locates the first error in the reasoning chain via binary search, updating the tree structure accordingly.
  3. Maintain: Updates tree statistics and state values, ensuring the process is optimized for training the PRM.

Results and Comparisons

The PRM trained using OmegaPRM annotations achieved a 69.4% success rate on the MATH benchmark, a notable improvement over the 51% success rate of the base Gemini Pro model. This was achieved without any human intervention, contrasting with the 800,000 human-annotated steps used in previous research. The results suggest that even with potential noise in the automated annotations, the overall quality and volume of data contribute to superior model performance.

Implications and Future Work

The research underscores the potential of automated and scalable methods for enhancing LLM capabilities, particularly in complex multi-step reasoning tasks. It suggests that future work could explore:

  • Reducing Annotation Noise: Enhancing the precision of automated annotations to further improve PRM training.
  • Expanding to Other Domains: Adapting the OmegaPRM algorithm for open-ended tasks and other reasoning domains.
  • Human and Machine Collaboration: Developing hybrid models that integrate human expertise with automated processes for even higher-quality supervision data.

In summary, the paper introduces a robust, scalable method for process supervision that significantly enhances the mathematical reasoning performance of LLMs. The OmegaPRM algorithm stands out for its efficiency and the quality of data it generates, presenting a promising avenue for future improvements in AI capabilities.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube