RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models (2402.10038v2)
Abstract: Reinforcement learning from human feedback (RLHF) has been extensively employed to align LLMs with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference optimization (DPO) is proposed to address those challenges. However, DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model, limiting the effectiveness of the RLHF. In this paper, we addresses both challenges by systematically combining rejection sampling (RS) and DPO. Our proposed method, RS-DPO, initiates with the development of a supervised fine-tuned policy model (SFT). A varied set of k responses per prompt are sampled directly from the SFT model. RS-DPO identifies pairs of contrastive samples based on their reward distribution. Finally, we apply DPO with the contrastive samples to align the model to human preference. Our experiments indicate that our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent. Furthermore, it outperforms existing methods, including RS, PPO, and DPO.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- OpenAssistant. 2023. Openassistant/oasst-rm-2-pythia-6.9b-epoch-1. Accessed: 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
- Efficient rlhf: Reducing the memory usage of ppo. arXiv preprint arXiv:2309.00754.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716.
- Learning to summarize from human feedback. In NeurIPS.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
- Enable language models to implicitly learn self-improvement from data. arXiv preprint arXiv:2310.00898.
- Ronald J. Williams. 2004. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.