Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier (2405.17956v4)

Published 28 May 2024 in cs.AI

Abstract: For aligning LLMs, prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to easily tune LLMs to maximize auxiliary, non-preferential objectives according to the LLM designer's preferences (e.g., tuning lexical style or minimizing specific kinds of harmful content). Critically, these designer objectives may not be amply human-labeled or represented in available data, align with user preferences, or even be able to be captured tractably by binary preference pairs. To leverage the simplicity and performance of DPO with the generality of RL, we propose a unified approach. Based on a simple decomposition of preference and auxiliary objectives, we allow for tuning LLMs to optimize user and designer preferences without any additional specialized or preference data, computational cost, stability ``tweaks'', or training instability. The proposed method, Unified Preference Optimization, shows the ability to effectively generalize to user preferences and auxiliary objectives, while preserving or surpassing alignment performance on challenging benchmarks across a range of model sizes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11, pages 12–27. Springer.
  2. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571.
  3. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  4. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
  5. Improving language models with advantage-based offline policy gradients. arXiv preprint arXiv:2305.14718.
  6. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  9. Preference-based policy iteration: Leveraging preference learning for reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11, pages 312–327. Springer.
  10. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  11. Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751.
  12. Understanding dataset difficulty with v-usable information. In International Conference on Machine Learning, pages 5988–6008. PMLR.
  13. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  15. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–292.
  16. Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models. arXiv preprint arXiv:2402.10038.
  17. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
  18. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR.
  19. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169.
  20. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
  21. Enhancing llm safety via constrained direct preference optimization. arXiv preprint arXiv:2403.02475.
  22. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359.
  23. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  24. Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  26. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  27. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Advances in Neural Information Processing Systems, 36.
  28. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR.
  29. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  30. Algorithms. Addison-wesley professional.
  31. Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871.
  32. Icdpo: Effectively borrowing alignment capability of others via in-context direct preference optimization. arXiv preprint arXiv:2402.09320.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  34. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. arXiv preprint arXiv:2309.16240.
  35. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  36. Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets