Emergent Mind

Self-Rewarding Language Models

(2401.10020)
Published Jan 18, 2024 in cs.CL and cs.AI

Abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

Diagram depicting how language models generate rewards for self-improvement in learning processes.

Overview

  • The study introduces Self-Rewarding Language Models as a framework where LLMs generate responses and also evaluate their quality for self-improvement.

  • Through Iterative Direct Preference Optimization, the LLMs use self-instruction and self-judgment to enhance their performance beyond static, human-derived reward models.

  • In experiments with the Llama 2 70B model, these self-rewarding models exhibited improved instructional performance and reward-evaluating abilities.

  • Self-Rewarding Language Models demonstrate the potential to surpass current LLM capabilities trained with extensive proprietary datasets.

  • The approach may revolutionize LLM training, though further research is needed to explore long-term impacts and safety implications.

Introduction

Aligning LLMs with human values and preferences is critical for their effective and safe deployment. Typically, LLM training has involved human preference data to tune these models for better task compliance, using diverse approaches like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). However, these methods face limitations due to the finite scope of available human feedback and the static nature of externally built reward models. A novel study examines the concept of Self-Rewarding Language Models, where LLMs act as both respondent to tasks and judge of their own responses, establishing a framework for self-improving, dynamic reward modeling.

Training Self-Rewarding Language Models

The study posits that by endowing language models with dual capabilities—they not only generate responses to tasks but also appraise the quality of generated responses—you achieve self-alignment. This approach involves Iterative DPO training, beginning with a base pretrained LLM supplemented by a limited set of human-annotated data. Subsequent models iterate through a cycle of creating self-instruction examples and then rewarding them based on the model's own judgments. The evaluations are not arbitrary but follow formulated criteria to ensure responses' relevancy, completeness, perspective, and quality.

Methodology Insights

In a series of experiments using the Llama 2 70B model as a base, researchers demonstrate an increase in instructional performance as well as in the model's innate reward-evaluating ability. Through self-generated feedback and Iterative DPO, subsequent models surpassed their predecessor's capabilities, resulting in increasingly sophisticated LLMs. Notably, the performance of these self-rewarded models on AlpacaEval 2.0 surpasses existing LLMs trained using larger, proprietary data sets.

Implications and Future Exploration

Early findings suggest that the concept of Self-Rewarding Language Models could redefine the training of LLMs. By facilitating self-improvement, models may bypass the limitations set by human-derived reward systems. The iterative process potentially enables a continuous quality augmentation beyond existing benchmarks of human feedback quality. However, the long-term saturation of self-rewarding efficiencies, safety implications, and broader evaluative measures have yet to be fully assessed, rendering these findings preliminary yet promising avenues for future research.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews
Self-Rewarding Language Models (93 points, 60 comments)
Self-Rewarding Language Models (3 points, 0 comments)
References
  1. GPT-4 Technical Report
  2. The CRINGE loss: Learning what language not to model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8854–8874, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.493. https://aclanthology.org/2023.acl-long.493.

  3. Anthropic. Claude 2. https://www.anthropic.com/index/claude-2

  4. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  5. Constitutional AI: Harmlessness from AI Feedback
  6. Benchmarking foundation models with language-model-as-an-examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. https://openreview.net/forum?id=IiRHQ7gvnq.

  7. AlpaGasus: Training A Better Alpaca with Fewer Data
  8. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
  9. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167
  10. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  11. The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation
  12. Reinforced Self-Training (ReST) for Language Modeling
  13. Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. https://aclanthology.org/2023.acl-long.806.

  14. Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
  15. OpenAssistant Conversations -- Democratizing Large Language Model Alignment
  16. RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
  17. Self-Alignment with Instruction Backtranslation
  18. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.

  19. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  20. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
  21. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  22. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=HPuSIXJaa9.

  23. Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
  24. Proximal Policy Optimization Algorithms
  25. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  27. Llama 2: Open Foundation and Fine-Tuned Chat Models
  28. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11)
  29. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  30. Some things are more CRINGE than others: Iterative Preference Optimization with the Pairwise Cringe Loss
  31. RRHF: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=EdIGMCHk4l.

  32. SLiC-HF: Sequence Likelihood Calibration with Human Feedback
  33. Click: Controllable text generation with sequence likelihood contrastive learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1022–1040, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.65. https://aclanthology.org/2023.findings-acl.65.

  34. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. https://openreview.net/forum?id=uccHPGDlao.

  35. Fine-Tuning Language Models from Human Preferences

Show All 35