Self-Exploring Language Models: Active Preference Elicitation for Online Alignment (2405.19332v3)
Abstract: Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning LLMs to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring LLMs (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635, 2024.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
- Rlhf workflow: From reward modeling to online rlhf. arXiv e-prints, pages arXiv–2405, 2024.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
- Efficient exploration for llms. arXiv preprint arXiv:2402.00396, 2024.
- The rating of chessplayers: Past and present. Ishi Press International, 1978.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
- Braden Hancock Hoang Tran, Chris Glaze. Snorkel-mistral-pairrm-dpo. 2024.
- Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, 2024.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
- Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory, pages 2137–2143. PMLR, 2020.
- Exploiting asymmetry for synthetic training data generation: Synthie and the case of information extraction. arXiv preprint arXiv:2303.04132, 2023.
- sdpo: Don’t use your data all at once. arXiv preprint arXiv:2403.19270, 2024.
- Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
- Multi-modal preference alignment remedies regression of visual instruction tuning on language model. arXiv preprint arXiv:2402.10884, 2024.
- Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Best practices and lessons learned on synthetic data for language models, 2024.
- Maximize to explore: One objective function fusing estimation, planning, and exploration. Advances in Neural Information Processing Systems, 36, 2024.
- Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer, 2024.
- Ensemble sampling. Advances in neural information processing systems, 30, 2017.
- Meta. Introducing meta llama 3: The most capable openly available llm to date. 2024.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Language model alignment with elastic reset. Advances in Neural Information Processing Systems, 36, 2024.
- (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.
- Approximate thompson sampling via epistemic neural networks. In Uncertainty in Artificial Intelligence, pages 1586–1595. PMLR, 2023.
- Epistemic neural networks. Advances in Neural Information Processing Systems, 36, 2024.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Samuel J Paech. Eq-bench: An emotional intelligence benchmark for large language models. arXiv preprint arXiv:2312.06281, 2023.
- Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- From r𝑟ritalic_r to q∗superscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715, 2024.
- Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
- Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943–950, 2000.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.
- Understanding the performance gap between online and offline alignment algorithms. arXiv preprint arXiv:2405.08448, 2024.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- The alignment handbook. https://github.com/huggingface/alignment-handbook, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Enhancing visual-language modality alignment in large vision language models via self-improvement, 2024.
- How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36, 2024.
- Self-evolved diverse data sampling for efficient instruction tuning. arXiv preprint arXiv:2311.08182, 2023.
- Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675, 2024.
- Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456, 2023.
- Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
- Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719, 2024.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
- Iterative reasoning preference optimization. arXiv e-prints, pages arXiv–2404, 2024.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Shenao Zhang. Conservative dual policy optimization for efficient model-based reinforcement learning. Advances in neural information processing systems, 35:25450–25463, 2022.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Starling-7b: Improving llm helpfulness and harmlessness with rlaif, November 2023.