Emergent Mind

Abstract

Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $106$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.

Comparison of maj@1 scores for PLR, Backtracking, PPO, and SFT techniques.

Overview

  • Havrilla et al. evaluate the impact of multiple reinforcement learning (RL) algorithms on the reasoning abilities of LLMs, identifying Expert Iteration (EI) as the most effective in most scenarios.

  • The study positions reasoning tasks within the Markov Decision Process (MDP) framework, aiding the application of RL algorithms to improve LLM reasoning with both sparse and dense rewards.

  • Performance of RL algorithms was assessed using four metrics, where EI showed superior results, challenging the anticipated efficiency advantages of Proximal Policy Optimization (PPO) in complex environments.

  • Findings suggest pretraining plays a pivotal role in LLM capabilities, and future enhancements through RL may require innovative strategies for exploration beyond learned patterns.

Enhancing LLMs' Reasoning Capabilities with Reinforcement Learning

Performance of Reinforcement Learning Algorithms on LLM Reasoning Tasks

In the recent study conducted by Havrilla et al., multiple reinforcement learning (RL) algorithms were examined for their effectiveness in amplifying the reasoning capabilities of LLMs. The study meticulously compared Expert Iteration (EI), Proximal Policy Optimization (PPO), and Return-Conditioned Reinforcement Learning (RCRL) across various settings, involving different rewards structures, model sizes, and initializations, both with and without previously fine-tuned data. Notably, EI consistently emerged as the superior approach in most scenarios, with its performance closely rivaling that of PPO, demonstrating a similar degree of sample efficiency, which is contrary to conventional expectations in traditional RL applications.

Methodological Insights

Reinforcement Learning Formulation for Reasoning

The researchers adeptly formulated reasoning tasks as an RL problem by considering the Markov Decision Process (MDP) framework, applied to question-answer tuples. This innovative approach facilitated the application of RL algorithms to refine the LLMs' reasoning processes, employing both sparse and dense rewards.

Algorithm Comparisons and Performance Metrics

EI, PPO, and RCRL were scrutinized across four primary performance metrics: maj@1, maj@96, rerank@96, and pass@96 scores. Interestingly, despite the varying complexity and theoretical advantages of these algorithms under different conditions, EI displayed superior performance across most metrics. A crucial finding was the similar sample efficiency of EI and PPO, challenging the prevalent notion of PPO's superior efficiency in complex environments, arguably due to the deterministic nature of the reasoning tasks and the influence of LLM pretraining.

Implications and Future Directions

Exploration Limitations and Role of Pretraining

A significant observation was the models' apparent lack of deep exploration beyond the purviews of SFT models or pretraining, suggesting a strong reliance on previously learned patterns. This underscores the critical role of pretraining in shaping LLMs' capabilities and highlights a potential bottleneck for further enhancements through RL, limited by the extent of exploration.

Theoretical and Practical RL Considerations

The study draws attention to the contextual performance of different RL algorithms, suggesting that environments with deterministic dynamics, such as reasoning tasks, may not fully leverage the intricate mechanisms of algorithms like PPO designed for stochastic settings. Additionally, the findings advocate for a broader exploration strategy to transcend the boundaries established by pretraining and fine-tuning, possibly through more sophisticated prompting strategies or hybrid models combining evolution-based methods with LLM generative powers.

Concluding Remarks

Havrilla et al.'s exploration into using RL for refining LLM reasoning ability delivers insightful comparisons across leading algorithms while exemplifying the critical influence of LLM pretraining. The convergence in performance between EI and PPO, despite their theoretical divergences, points to the nuanced interplay between algorithmic efficiency and the foundational role of pretraining in LLM task performance. As we look toward future developments, the quest for enhanced reasoning capabilities in AI may well depend on innovative strategies that promote genuine exploration and learning beyond the confines of existing knowledge, potentially reshaping our approach to AI reasoning.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit