Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment (2405.17931v1)
Abstract: Effectively aligning LLMs with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement Learning from Human Feedback (RLHF). In this paper, we first discover that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax at the cost of alignment reward. Inspired by this, we propose integrating the RL policy and SFT models at each optimization step in RLHF to continuously regulate the training direction, introducing the Online Merging Optimizer. Specifically, we merge gradients with the parameter differences between SFT and pretrained models, effectively steering the gradient towards maximizing rewards in the direction of SFT optimization. We demonstrate that our optimizer works well with different LLM families, such as Qwen and LLaMA, across various model sizes ranging from 1.8B to 8B, various RLHF algorithms like DPO and KTO, and existing model merging methods. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.
- Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations.
- Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571.
- Program synthesis with large language models.
- A general theoretical paradigm to understand learning from human preferences.
- Qwen technical report.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073.
- Lora learns less and forgets less.
- Evaluating large language models trained on code.
- Training verifiers to solve math word problems.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
- How abilities in large language models are affected by supervised fine-tuning data composition.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475.
- Alpacafarm: A simulation framework for methods that learn from human feedback.
- The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR.
- Codeapex: A bilingual programming evaluation benchmark for large language models.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31.
- An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
- Direct language model alignment from online ai feedback.
- Measuring massive multitask language understanding.
- Orpo: Monolithic preference optimization without reference model.
- Lora: Low-rank adaptation of large language models.
- J Stuart Hunter. 1986. The exponentially weighted moving average. Journal of quality technology, 18(4):203–210.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
- Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
- Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852.
- Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Rose: Robust selective fine-tuning for pre-trained language models.
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
- Ds-1000: A natural and reliable benchmark for data science code generation.
- Mixout: Effective regularization to finetune large-scale pretrained language models.
- Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
- R-drop: Regularized dropout for neural networks.
- Mitigating the alignment tax of rlhf.
- Spurious feature diversification improves out-of-distribution generalization.
- Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam.
- #instag: Instruction tagging for analyzing supervised fine-tuning of large language models. In The Twelfth International Conference on Learning Representations.
- Michael Matena and Colin Raffel. 2022. Merging models with fisher-weighted averaging.
- Language model alignment with elastic reset.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Disentangling length from quality in direct preference optimization.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010.
- Coqa: A conversational question answering challenge.
- Proximal policy optimization algorithms.
- Preference ranking optimization for human alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18990–18998.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36.
- Merging by matching models in task parameter subspaces. Transactions on Machine Learning Research.
- Zephyr: Direct distillation of lm alignment.
- A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computational Linguistics.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.
- Self-evolved diverse data sampling for efficient instruction tuning.
- Self-play preference optimization for language model alignment.
- Raise a child in large language model: Towards effective and generalizable fine-tuning.
- Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719.
- TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Language models are super mario: Absorbing abilities from homologous models as a free lunch.
- Hype: Better pre-trained language model fine-tuning with hidden representation perturbation.
- How well do large language models perform in arithmetic tasks?
- Rrhf: Rank responses to align language models with human feedback without tears.
- Galore: Memory-efficient llm training by gradient low-rank projection.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Weak-to-strong extrapolation expedites alignment. arXiv preprint arXiv:2404.16792.
- Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964.
- Instruction-following evaluation for large language models.