Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning (2407.02119v2)

Published 2 Jul 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current LLM pipelines, is \textit{bottlenecked by the size of human preference data}. While traditional methods rely on offline preference dataset constructions, recent approaches have shifted towards online settings, where a learner uses a small amount of labeled seed data and a large pool of unlabeled prompts to iteratively construct new preference data through self-generated responses and high-quality reward/preference feedback. However, most current online algorithms still focus on preference labeling during policy model updating with given feedback oracles, which incurs significant expert query costs. \textit{We are the first to explore cost-effective proxy reward oracles construction strategies for further labeling preferences or rewards with extremely limited labeled data and expert query budgets}. Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries. Using these methods, we train a evaluation model with minimal expert-labeled data, which then effectively labels nine times more preference pairs for further RLHF training. For instance, our model using Direct Preference Optimization (DPO) gains around over 1% average improvement on AlpacaEval2, MMLU-5shot and MMLU-0shot, with only 1.7K query cost. Our methodology is orthogonal to other direct expert query-based strategies and therefore might be integrated with them to further reduce query costs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Gone fishing: Neural active learning with fisher embeddings. Advances in Neural Information Processing Systems, 34:8927–8939.
  2. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. An experimental design framework for label-efficient supervised finetuning of large language models. arXiv preprint arXiv:2401.06692.
  5. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335.
  6. Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944.
  7. Combinatorial optimisation. Wiley-Interscience Series in Discrete Mathematics and Optimization, USA, 1:998.
  8. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
  9. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863.
  10. Rebel: Reinforcement learning via regressing relative rewards.
  11. Yonatan Geifman and Ran El-Yaniv. 2017. Deep active learning over the long tail. arXiv preprint arXiv:1711.00941.
  12. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  13. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178.
  14. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
  15. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787.
  16. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  17. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373.
  18. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  19. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  20. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715.
  21. Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations.
  22. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  24. Iterative dpo alignment. Technical report, Snorkel AI.
  25. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508.
  26. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022.
  27. Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675.
  28. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.
  29. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.
  30. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682.
  31. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
  32. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  33. Starling-7b: Improving llm helpfulness & harmlessness with rlaif.

Summary

We haven't generated a summary for this paper yet.