Improve Mathematical Reasoning in Language Models by Automated Process Supervision (2406.06592v2)

Published 5 Jun 2024 in cs.CL and cs.LG

Abstract: Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced LLMs. Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named \textit{OmegaPRM} for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train Process Reward Models (PRMs). This fully automated process supervision alongside the weighted self-consistency algorithm is able to enhance LLMs' math reasoning performances. We improved the success rates of the instruction-tuned Gemini Pro model from 51\% to 69.4\% on MATH500 and from 86.4\% to 93.6\% on GSM8K. Similarly, we boosted the success rates of Gemma2 27B from 42.3\% to 58.2\% on MATH500 and from 74.0\% to 92.2\% on GSM8K. The entire process operates without any human intervention or supervision, making our method both financially and ...

Citations (49)

View on Semantic Scholar

Summary

The paper introduces the OmegaPRM algorithm, automating process supervision to enhance LLM mathematical reasoning.
It scales data collection to over 1.5 million annotations, reducing the need for expensive human input.
Empirical validation shows a 36% improvement on the MATH benchmark, confirming the model's enhanced performance.

Improve Mathematical Reasoning in LLMs by Automated Process Supervision

The paper, "Improve Mathematical Reasoning in LLMs by Automated Process Supervision," authored by researchers at Google DeepMind and Google, introduces a novel approach to enhance the mathematical reasoning capabilities of LLMs. The methodology leverages a divide-and-conquer Monte Carlo Tree Search (MCTS) algorithm, named OmegaPRM, to improve the efficiency and quality of process supervision data collection, subsequently training a Process Reward Model (PRM) to boost LLM performance.

Background

Complex multi-step reasoning tasks, such as mathematical problem-solving and code generation, present significant challenges for LLMs. Approaches like Chain-of-Thought (CoT) prompting and self-consistency strategies have shown improvements, yet they often fall short for tasks requiring lengthy or multi-hop reasoning. Previous methods to collect process supervision data have been either manually intensive or computationally expensive, limiting their scalability.

Contributions

The paper introduces several key advancements:

OmegaPRM Algorithm: A novel MCTS algorithm tailored for generating high-quality process supervision annotations. Unlike existing techniques, OmegaPRM efficiently identifies the first error in the reasoning chain, thereby enhancing the quality of data collected.
Automated and Scalable Process Supervision: The proposed method enables the collection of over 1.5 million annotations without human intervention, making the process both financially and computationally efficient.
Improved Instruction-Tuned Model: The researchers integrated their process supervision data with a weighted self-consistency algorithm, leading to significant performance improvements on the MATH benchmark.
Empirical Validation: They demonstrated a 36% relative improvement in success rates for the Gemini Pro model on complex mathematical reasoning tasks.

Methodology

Process Supervision

The approach involves training a PRM to predict the correctness of each step in the reasoning process. Traditional methods require human annotations or expensive per-step Monte Carlo estimation. The proposed OmegaPRM algorithm instead uses a divide-and-conquer technique to identify incorrect steps more efficiently.

Monte Carlo Tree Search

The MCTS algorithm creates a tree structure where each node represents a state (a question and its corresponding partial solution), and edges represent state transitions guided by LLM policy. The OmegaPRM algorithm involves three phases:

Select: Utilizes a modified PUCT algorithm to select rollouts based on a state-rollout value function.
Binary Search: Efficiently locates the first error in the reasoning chain via binary search, updating the tree structure accordingly.
Maintain: Updates tree statistics and state values, ensuring the process is optimized for training the PRM.

Results and Comparisons

The PRM trained using OmegaPRM annotations achieved a 69.4% success rate on the MATH benchmark, a notable improvement over the 51% success rate of the base Gemini Pro model. This was achieved without any human intervention, contrasting with the 800,000 human-annotated steps used in previous research. The results suggest that even with potential noise in the automated annotations, the overall quality and volume of data contribute to superior model performance.

Implications and Future Work

The research underscores the potential of automated and scalable methods for enhancing LLM capabilities, particularly in complex multi-step reasoning tasks. It suggests that future work could explore:

Reducing Annotation Noise: Enhancing the precision of automated annotations to further improve PRM training.
Expanding to Other Domains: Adapting the OmegaPRM algorithm for open-ended tasks and other reasoning domains.
Human and Machine Collaboration: Developing hybrid models that integrate human expertise with automated processes for even higher-quality supervision data.

In summary, the paper introduces a robust, scalable method for process supervision that significantly enhances the mathematical reasoning performance of LLMs. The OmegaPRM algorithm stands out for its efficiency and the quality of data it generates, presenting a promising avenue for future improvements in AI capabilities.

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1802332279680770426

https://twitter.com/_philschmid/status/1838865177116045533

https://twitter.com/hbouammar/status/1825604818712539489

https://twitter.com/kimmonismus/status/1800872198133219709

https://twitter.com/burny_tech/status/1801362224411271260

https://twitter.com/fly51fly/status/1800808380006822268