Solving math word problems with process- and outcome-based feedback

Published 25 Nov 2022 in cs.LG, cs.AI, and cs.CL | (2211.14275v1)

Abstract: Recent work has shown that asking LLMs to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% $\to$ 12.7% final-answer error and 14.0% $\to$ 3.4% reasoning error among final-answer-correct solutions.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (181)

View on Semantic Scholar

Summary

The paper systematically compares process- and outcome-based supervision for solving math word problems, highlighting trade-offs in label use and reasoning quality.
Outcome-based methods achieve competitive final-answer error reductions, while process-based supervision ensures superior trace accuracy.
Combining supervised fine-tuning, reinforcement learning, and reward models, the approach reduces final-answer error from 16.8% to 12.7% and trace error from 14.0% to 3.4%.

Evaluating Process- and Outcome-Based Supervision for Math Word Problems

This paper presents a detailed evaluation of process-based and outcome-based supervisory approaches for training LMs to solve math word problems, particularly using the GSM8K dataset. The primary contribution stems from a systematic comparison of these approaches, focusing on two distinct error metrics: final-answer errors and trace errors. The paper elucidates the trade-offs inherent in each supervisory approach, particularly in terms of the efficiency of label use and the quality of the model's reasoning process.

Summary and Methodology

The authors train and evaluate LMs using various combinations of supervised fine-tuning (SFT), reinforcement learning (RL) with expert iteration, and reward models (RMs). The process-based supervision relies on reasoning traces and evaluations of each step, while the outcome-based supervision assesses only the final answer's correctness. The latter is posited as label-efficient, requiring minimal supervision per question. An RM framework enhances the models by prioritizing sequences that maximize correct outcomes.

Key Findings:

Final-Answer Error: Models trained under outcome-based supervision using RL or RMs achieve competitive final-answer error rates compared to process-based SFT approaches, indicating that assessing the final result is often sufficient to drive improvements in answer accuracy.
Trace Error: The models employing process-based supervision or emulated process feedback (PRM) through reward models (ORM) demonstrate superior trace error rates, underscoring their efficacy in ensuring that the reasoning steps align with human expectations. A significant conclusion is that ORM-trained models often approximate the performance of PRM-trained models in trace correctness.
Quantitative Performance: The combination of supervised learning with reward-model-based reinforcement learning sets a new benchmark, reducing the final-answer error from 16.8% to 12.7% and trace error from 14.0% to 3.4%.

Implications and Future Directions

One of the key insights from this study is the nuanced role of RMs trained with outcome-based labels. Despite being exposed to outcome information, these models appear to implicitly learn process-based cues, achieving trace performance closer to models trained with explicit process-based supervision. This finding highlights the potential for these models to bridge the gap between process and outcome information, leveraging the strengths of both approaches.

From a practical perspective, the research suggests that context should guide supervisory approach choice. Where final correctness suffices, outcome-based methods reign due to their efficiency. Conversely, in domains where the interpretability of reasoning is critical, process-based supervision or its approximations are indispensable.

Theoretically, these results advocate for a refined understanding of how supervisory signals propagate through reinforcement learning and reward modeling architectures. This understanding could inform the development of more sophisticated algorithms that dynamically incorporate process and outcome feedback.

Conclusion

This paper provides a rigorous and insightful analysis of how different supervisory approaches influence the performance of LMs on complex reasoning tasks like math problem-solving. The results challenge the simplistic dichotomy between process and outcome approaches, showing that both can usefully inform model training, depending on the goals and constraints of the problem domain. Future research may explore this dual-feedback paradigm further, investigating the broader applicability to dynamic and less structured domains beyond mathematical reasoning.

Markdown Report Issue