Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards (2402.18571v3)
Abstract: Fine-grained control over LLMs remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles. Additionally, DPA models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. Our method involves training a multi-objective reward model and then fine-tuning the LLM with a preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method adopted by Llama 2. This method enjoys a better performance trade-off across various reward objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as Direct Preference Optimization (DPO).
- Anthropic. Introducing claude. 2023. URL https://www.anthropic.com/index/introducing-claude.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
- Peering through preferences: Unraveling feedback acquisition for aligning large language models. arXiv preprint arXiv:2308.15812, 2023.
- E. Biyik and D. Sadigh. Batch active preference-based learning of reward functions. In Conference on robot learning, pages 519–528. PMLR, 2018.
- Aligning robot and human representations. arXiv preprint arXiv:2302.01928, 2023.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- R. Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Odin: Disentangled reward mitigates hacking in rlhf, 2024.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752, 2019.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- C. C. Coello. Handling preferences in evolutionary multiobjective optimization: A survey. In Proceedings of the 2000 congress on evolutionary computation. CEC00 (Cat. No. 00TH8512), volume 1, pages 30–37. IEEE, 2000.
- Ultrafeedback: Boosting language models with high-quality feedback, 2023.
- Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420, 2023.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
- RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023a. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY.
- Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. arXiv preprint arXiv:2310.05344, 2023b.
- Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
- Moral machine or tyranny of the majority? arXiv preprint arXiv:2305.17319, 2023.
- P. C. Fishburn. Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692, 1984.
- W. V. Gehrlein. Condorcet’s paradox and the likelihood of its occurrence: different perspectives on balanced preferences. Theory and decision, 52:171–199, 2002.
- A. Ghane-Kanafi and E. Khorram. A new scalarization method for finding the efficient frontier in non-convex multi-objective problems. Applied Mathematical Modelling, 39(23-24):7483–7498, 2015.
- Google. Bard. 2023. URL https://bard.google.com/.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
- Revisiting scalarization in multi-task learning: A theoretical perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=6EqUpqMnwl.
- Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Who answers it better? an in-depth analysis of chatgpt and stack overflow answers to software engineering questions. arXiv preprint arXiv:2308.02312, 2023.
- Openassistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=VSJotgbPHF.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- K. O. May. Intransitivity, utility, and the aggregation of preference patterns. Econometrica: Journal of the Econometric Society, pages 1–13, 1954.
- Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pages 26837–26867. PMLR, 2023.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Online learning to rank for sequential music recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 237–245, 2019.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. arXiv preprint arXiv:2306.04488, 2023.
- Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023a.
- A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023b.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- S. H. Sternberg. Mathematics and Social Sciences: Proceedings of the Seminars of Menthon-Saint-Bernard, France (1-27 July, 1960) and of Gösing, Austria (3-27 July, 1961), volume 1. Mouton, 1965.
- A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- A. Tversky. Intransitivity of preferences. Psychological review, 76(1):31, 1969.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Pre-trained language models in biomedical domain: A systematic survey. ACM Computing Surveys, 56(3):1–52, 2023a.
- Is rlhf more difficult than standard rl? arXiv preprint arXiv:2306.14111, 2023b.
- Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint arXiv:2311.09528, 2023c.
- Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint arXiv:2311.09528, 2023d.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023a.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023b.
- Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456, 2023.
- A theoretical analysis of nash learning from human feedback under general kl-regularized preference. arXiv preprint arXiv:2402.07314, 2024.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.