We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.
The study introduces Self-Rewarding Language Models as a framework where LLMs generate responses and also evaluate their quality for self-improvement.
Through Iterative Direct Preference Optimization, the LLMs use self-instruction and self-judgment to enhance their performance beyond static, human-derived reward models.
In experiments with the Llama 2 70B model, these self-rewarding models exhibited improved instructional performance and reward-evaluating abilities.
Self-Rewarding Language Models demonstrate the potential to surpass current LLM capabilities trained with extensive proprietary datasets.
The approach may revolutionize LLM training, though further research is needed to explore long-term impacts and safety implications.
Aligning LLMs with human values and preferences is critical for their effective and safe deployment. Typically, LLM training has involved human preference data to tune these models for better task compliance, using diverse approaches like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). However, these methods face limitations due to the finite scope of available human feedback and the static nature of externally built reward models. A novel study examines the concept of Self-Rewarding Language Models, where LLMs act as both respondent to tasks and judge of their own responses, establishing a framework for self-improving, dynamic reward modeling.
The study posits that by endowing language models with dual capabilities—they not only generate responses to tasks but also appraise the quality of generated responses—you achieve self-alignment. This approach involves Iterative DPO training, beginning with a base pretrained LLM supplemented by a limited set of human-annotated data. Subsequent models iterate through a cycle of creating self-instruction examples and then rewarding them based on the model's own judgments. The evaluations are not arbitrary but follow formulated criteria to ensure responses' relevancy, completeness, perspective, and quality.
In a series of experiments using the Llama 2 70B model as a base, researchers demonstrate an increase in instructional performance as well as in the model's innate reward-evaluating ability. Through self-generated feedback and Iterative DPO, subsequent models surpassed their predecessor's capabilities, resulting in increasingly sophisticated LLMs. Notably, the performance of these self-rewarded models on AlpacaEval 2.0 surpasses existing LLMs trained using larger, proprietary data sets.
Early findings suggest that the concept of Self-Rewarding Language Models could redefine the training of LLMs. By facilitating self-improvement, models may bypass the limitations set by human-derived reward systems. The iterative process potentially enables a continuous quality augmentation beyond existing benchmarks of human feedback quality. However, the long-term saturation of self-rewarding efficiencies, safety implications, and broader evaluative measures have yet to be fully assessed, rendering these findings preliminary yet promising avenues for future research.
The CRINGE loss: Learning what language not to model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8854–8874, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.493. https://aclanthology.org/2023.acl-long.493.
Anthropic. Claude 2. https://www.anthropic.com/index/claude-2
Benchmarking foundation models with language-model-as-an-examiner. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. https://openreview.net/forum?id=IiRHQ7gvnq.
Unnatural instructions: Tuning language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. https://aclanthology.org/2023.acl-long.806.
Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=HPuSIXJaa9.
Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
RRHF: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=EdIGMCHk4l.
Click: Controllable text generation with sequence likelihood contrastive learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1022–1040, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.65. https://aclanthology.org/2023.findings-acl.65.
Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. https://openreview.net/forum?id=uccHPGDlao.