From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function (2404.12358v2)

Published 18 Apr 2024 in cs.LG

Abstract: Reinforcement Learning From Human Feedback (RLHF) has been critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference. We theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the BeLLMan equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

References (60)

Citations (85)

View on Semantic Scholar

Summary

The paper establishes that adapting DPO to token-level MDPs transforms it into an inverse Q-learning algorithm for precise credit assignment.
It demonstrates the equivalence of token-level DPO to search-based methods like Monte Carlo Tree Search during decoding.
Empirical findings reveal that initial policy choices crucially influence implicit rewards, guiding improved LLM training strategies.

Revisiting Token-Level Optimization in LLMs with Direct Preference Optimization

Introduction to DPO in Token-Level MDPs

Reinforcement Learning From Human Feedback (RLHF) remains a mainstay in aligning LLMs to human-defined objectives. Historically, these methods have relied heavily on Reward Models trained via human feedback, imposing considerable complexity. Direct Preference Optimization (DPO), a more recently formulated direct alignment method, simplifies the RLHF pipeline by bypassing the explicit reward model stage. Traditional approaches employ reinforcement learning to optimize token-level value functions derived from sparse reward signals collected upon completion of response generation. By contrast, DPO operates fundamentally at a single decision point perspective, akin to dealing with each complete response as an individual entity within a contextual bandits framework. This raises interesting theoretical challenges, particularly when translating DPO to function effectively at the token level.

DPO and Token-Level Derivations

A closer examination reveals that DPO can be theoretically integrated into the token-level Markov Decision Process (MDP) setting traditionally used for training LLMs. This transition involves interpreting DPO as a general inverse Q-learning algorithm, where the alignment between a model's responses and the desired reward signals manifested in human preferences can be realized across the sequential decision-making process of language generation.

Key Theoretical Insights

Token-Level Interpretation: When re-formulated to accommodate the sequential token generation process, DPO demonstrates the ability to perform credit assignment at each token, attributing different weights to each decision point based on its contribution to the final outcome.
Equivalence to Search-Based Methods: The foundational principles of DPO suggest that, under a token-level framework, it shares similarities with search-based algorithms such as Monte Carlo Tree Search (MCTS). Specifically, the theoretical re-interpretation posits that optimizing the DPO policy during the decoding phase is analogous to conducting a likelihood-based search across potential decision paths token by token.
Role of Initial Policy Choices: The deployment of different reference policies during training influences the trajectory of implicit rewards. Understanding this relationship can guide more effective training regimens that retain or enhance the model's adherence to desired behavior patterns.

Empirical Findings and Practical Applications

By applying the theoretical insights to practical scenarios, the paper also confirms several key phenomena:

Credit Assignment and Policy Improvement: Empirical results validate that DPO, when adapted to the token-level, can assign credit effectively and improve over the base policy using beam search strategies.
Dynamic Behavior of Implicit Rewards: Observations indicate that the selection of initial policies significantly affects the evolution of implicit rewards during training, underscoring the delicate balance required in setting up initial conditions for DPO.

Future Directions and Speculations

This work invites further exploration into the granular controls and theoretical implications of direct preference optimization methods, considering their potential to streamline and potentially enhance the training of LLMs for complex, nuanced tasks. The adaptation of DPO into a token-level MDP framework not only bridges a critical gap in applying bandit-like strategies to more expansive sequential decision-making landscapes but also reinforces the model's ability to engage in nuanced, contextually driven language generation tasks.

Exploring the intersection of DPO with other RL techniques and extending these concepts to multimodal contexts or more complex interaction frameworks could further uncover latent potentialities in generative model training methodologies. The continuous evolution of LLMs and their training algorithms promises a fertile ground for advancing our understanding and capabilities in AI-driven natural language understanding and generation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/agi2025/status/1781133831887798749

https://twitter.com/DivGarg_/status/1832848445440491982

https://twitter.com/hillbig/status/1782538820212408323

https://twitter.com/fly51fly/status/1781304542086476257

https://twitter.com/rustyryan/status/1781160693783499126

https://twitter.com/rm_rafailov/status/1785453854701793645