Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 115 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function (2404.12358v2)

Published 18 Apr 2024 in cs.LG

Abstract: Reinforcement Learning From Human Feedback (RLHF) has been critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference. We theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2011.
  2. Star-gate: Teaching language models to ask clarifying questions, 2024.
  3. A general theoretical paradigm to understand learning from human preferences, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback, 2022b.
  6. Improving image generation with better captions, 2023. URL https://cdn.openai.com/papers/dall-e-3.pdf.
  7. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  8. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023a.
  9. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023b.
  10. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. doi: https://doi.org/10.2307/2334029.
  11. Dense reward for free in reinforcement learning from human feedback. arXiv preprint arXiv:2402.00782, 2024.
  12. Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
  13. Sequencematch: Imitation learning for autoregressive sequence modelling with backtracking. arXiv preprint arXiv:2306.05426, 2023.
  14. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024.
  15. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
  16. Scaling rectified flow transformers for high-resolution image synthesis, 2024.
  17. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381, 2023.
  18. Alphazero-like tree-search can guide large language model decoding and training, 2024.
  19. Scaling laws for reward model overoptimization. International Conference on machine Learning, 2023.
  20. Iq-learn: Inverse soft-q learning for imitation, 2022.
  21. Photorealistic video generation with diffusion models, 2023.
  22. Inverse preference learning: Preference-based rl without a reward function. Advances in Neural Information Processing Systems, 36, 2024.
  23. Contrastive preference learning: Learning from human feedback without reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=iX1RjVQODj.
  24. V-star: Training verifiers for self-taught reasoners, 2024.
  25. Gaia-1: A generative world model for autonomous driving, 2023.
  26. Deal: Decoding-time alignment for large language models, 2024.
  27. Critic-guided decoding for controlled text generation. arXiv preprint arXiv:2212.10938, 2022.
  28. Models of human preference for learning reward functions. Transactions on Machine Learning Research, 2023.
  29. Learning optimal advantage from preferences and mistaking it for reward. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  10066–10073, 2024.
  30. Openassistant conversations – democratizing large language model alignment, 2023.
  31. Rewardbench: Evaluating reward models for language modeling, 2024.
  32. Aligning text-to-image models using human feedback. arXiv e-prints, pp.  arXiv–2302, 2023.
  33. Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018.
  34. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020.
  35. Learning to decode for future success. arXiv preprint arXiv:1701.06549, 2017.
  36. Don’t throw away your value model! making ppo even better via value-guided monte-carlo tree search decoding, 2023a.
  37. Making ppo even better: Value-guided monte-carlo tree search decoding. arXiv preprint arXiv:2309.15028, 2023b.
  38. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023.
  39. Controlled decoding from language models, 2024.
  40. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  41. Webgpt: Browser-assisted question-answering with human feedback, 2022.
  42. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pp.  278–287, 1999.
  43. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  44. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
  45. Let’s reinforce step by step. arXiv preprint arXiv:2311.05821, 2023.
  46. Disentangling length from quality in direct preference optimization, 2024.
  47. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2305.18290.
  48. Proximal policy optimization algorithms, 2017.
  49. Offline rl for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2022.
  50. Trial and error: Exploration-based trajectory optimization for llm agents, 2024.
  51. Learning to summarize from human feedback, 2022.
  52. Diffusion model alignment using direct preference optimization, 2023.
  53. A bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, 2012.
  54. Fudge: Controlled text generation with future discriminators. arXiv preprint arXiv:2104.05218, 2021.
  55. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  56. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023a.
  57. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023b.
  58. Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
  59. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp.  1433–1438. Chicago, IL, USA, 2008.
  60. Fine-tuning language models from human preferences, 2020.
Citations (85)

Summary

  • The paper introduces a theoretical framework where DPO acts as an inverse Q-learning algorithm, linking token-level MDPs with reinforcement learning principles.
  • It demonstrates empirically that DPO-trained models achieve effective token-level credit assignment, improving performance metrics in beam search experiments.
  • The paper discusses practical implications for enhancing multi-turn dialogues, end-to-end generative systems, and autonomous agent behaviors in AI.

From rr to QQ^*: Your LLM is Secretly a Q-Function

Introduction to RLHF and DPO

Reinforcement Learning from Human Feedback (RLHF) plays an essential role in aligning LLMs with human intent. Traditional RLHF methods leverage reinforcement learning (RL) frameworks like PPO to fine-tune models based on reward signals derived from human feedback. Direct Preference Optimization (DPO) emerges as an alternative that simplifies this setup by aligning models directly through preference data without an intermediate reward function. This paper introduces theoretical insights that connect DPO with token-level Markov Decision Processes (MDPs) in LLMs, proposing that DPO operates as an inverse Q-learning algorithm, satisfying the Bellman equations.

DPO and Token-Level MDP

In traditional RLHF, LLMs model token sequences as trajectories in an MDP where states are sequences of tokens, actions are vocabulary entries, and rewards are derived from human feedback. However, classical RLHF applies these rewards sparsely at terminal states, driving the overall optimization with policy gradient techniques. Contrastingly, DPO frames the problem in a contextual bandit setting—a paradigm where sequences are treated as single decisions rather than token steps. The novel derivation connects DPO's bandit formulation to the token-level MDP, implying that DPO implicitly learns a per-token reward, forming a Q-function over tokens (Figure 1). Figure 1

Figure 1

Figure 1: Credit assignment in DPO based on answer-level feedback. Each token is colored corresponding to the DPO implicit reward as expressed in the provided equations.

Empirical Insights and Theoretical Validation

The authors empirically demonstrate that DPO-trained models can manifest token-level credit assignment akin to what RLHF might achieve with dense rewards. This insight is crucial as it forms the theoretical foundation for interpreting the learning dynamics of LLMs under the DPO framework. Furthermore, the paper explores the equivalence between traditional search-based algorithms and likelihood-based search performed over a DPO policy, providing an empirical basis through beam search experiments that highlight meaningful improvements in model performance (Figure 2). Figure 2

Figure 2

Figure 2: Model performance using beam search, illustrating win rates and verbosity issues beyond five beams.

Performance Degradation and Implicit Rewards

The phenomenon of decreasing likelihoods observed in DPO training, counterintuitive under the assumption of likelihood maximization, is explained through the lens of maximum entropy RL. The paper elucidates that the implicit rewards modeled by DPO diminish over time when SFT precedes DPO—an expected behavior given the entropy-regularized objectives employed in DPO. Figure 3 captures the evolution of implicit rewards during training, affirming this behavior under different initialization conditions. Figure 3

Figure 3

Figure 3: Evolution of implicit rewards for DPO and CPL during training, indicating reward dynamics under various starting conditions.

Practical Implications and Future Directions

The findings suggest several practical applications and future research avenues, including:

  • Reasoning and Multi-turn Dialogues: Given DPO's capacity for per-token reward modeling, the approach could be extended to multi-turn dialogue systems, enhancing conversational alignment significantly better than current bandit evaluations.
  • End-to-End Generative Systems: DPO provides a cohesive framework to train prompt generators and conditioning models jointly, optimizing whole multi-model systems based on direct feedback (Figure 4).
  • Autonomous Agent Behavior: The ability to learn implicitly via token-level feedback opens possibilities to leverage DPO's strengths in agentic LLMs, promoting behaviors optimized for task-specific objectives obtained from preferences. Figure 4

    Figure 4: End-to-end generative AI workflow illustration, highlighting interactions between user prompts, refined descriptions, and image generation models.

Conclusion

This paper bridges the conceptual gap between DPO as a bandit-based method and reinforcement algorithms traditionally employed in RLHF pipelines. By framing DPO as a solution within token-level MDPs, the research asserts that LLMs can, and do, embed optimal Q-functions through preference-driven learning. The theoretical advances extend DPO's applicability to nuanced AI systems, prompting its adoption in innovative directions such as integrated speech and vision models, thus fostering broader impacts across AI domains.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 27 tweets and received 456 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com