Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging (2310.11564v1)
Abstract: While Reinforcement Learning from Human Feedback (RLHF) aligns LLMs with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem. Compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. These dimensions are defined based on personalizations that are declared as desirable by the user. In this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. The code is available at https://github.com/joeljang/RLPHF.
- PENS: A dataset and generic framework for personalized news headline generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 82–92, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.7. URL https://aclanthology.org/2021.acl-long.7.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862, 2022a. URL https://api.semanticscholar.org/CorpusID:248118878.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Dynamic planning in open-ended dialogue using reinforcement learning. arXiv preprint arXiv:2208.02294, 2022.
- Cold fusion: Collaborative descent for distributed multitask finetuning. arXiv preprint arXiv:2212.01378, 2022.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
- From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. arXiv preprint arXiv:2305.08283, 2023.
- Koala: A dialogue model for academic research. Blog post, April, 1, 2023.
- A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Lorahub: Efficient cross-task generalization via dynamic lora composition. ArXiv, abs/2307.13269, 2023. URL https://api.semanticscholar.org/CorpusID:260155012.
- Editing models with task arithmetic. ICLR, 2022.
- Exploring the benefits of training expert language models over instruction tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 14702–14729. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/jang23a.html.
- Aligning large language models through synthetic feedback. arXiv preprint arXiv:2305.13735, 2023.
- Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback. arXiv preprint arXiv:2303.05453, 2023.
- Continual learning for grounded instruction generation by observing human following behavior. Transactions of the Association for Computational Linguistics, 9:1303–1319, 2021.
- Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
- Towards controllable and personalized review generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3237–3245, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1319. URL https://aclanthology.org/D19-1319.
- Generating personalized recipes from historical user preferences. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5976–5982, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1613. URL https://aclanthology.org/D19-1613.
- Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2775–2779, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1298. URL https://aclanthology.org/D18-1298.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
- Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332, 2021a. URL https://api.semanticscholar.org/CorpusID:245329531.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021b.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022a. URL https://api.semanticscholar.org/CorpusID:246426909.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022b.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Lifting the curse of multilinguality by pre-training modular transformers. arXiv preprint arXiv:2205.06266, 2022.
- Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. arXiv preprint arXiv:2306.04488, 2023.
- Lamp: When large language models meet personalization. arXiv preprint arXiv:2304.11406, 2023.
- Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In International Conference on Machine Learning, pp. 8905–8915. PMLR, 2020.
- Reward is enough. Artificial Intelligence, 299:103535, 2021.
- A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
- Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Continual learning for instruction following from realtime feedback. arXiv preprint arXiv:2212.09710, 2022.
- Reinforcement learning: An introduction. MIT press, 2018.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Human-aligned artificial intelligence is a multiobjective problem. Ethics and Information Technology, 20:27–40, 2018.
- Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1):3483–3512, 2014.
- Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp. 191–199. IEEE, 2013.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
- lo-fi: distributed fine-tuning without communication. arXiv preprint arXiv:2210.11948, 2022a.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022b.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021a.
- Personalized response generation via generative split memory network. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1956–1970, 2021b.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
- Prediction-guided multi-objective reinforcement learning for continuous robot control. In International conference on machine learning, pp. 10607–10616. PMLR, 2020.
- Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5180–5197, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.356. URL https://aclanthology.org/2022.acl-long.356.
- A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Advances in neural information processing systems, 32, 2019.
- Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243, 2018.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Personalized dialogue generation with diversified traits. arXiv preprint arXiv:1901.09672, 2019.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.