Self-Rewarding Language Models

Published Jan 18, 2024 in cs.CL and cs.AI


We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

Diagram depicting how language models generate rewards for self-improvement in learning processes.


  • The study introduces Self-Rewarding Language Models as a framework where LLMs generate responses and also evaluate their quality for self-improvement.

  • Through Iterative Direct Preference Optimization, the LLMs use self-instruction and self-judgment to enhance their performance beyond static, human-derived reward models.

  • In experiments with the Llama 2 70B model, these self-rewarding models exhibited improved instructional performance and reward-evaluating abilities.

  • Self-Rewarding Language Models demonstrate the potential to surpass current LLM capabilities trained with extensive proprietary datasets.

  • The approach may revolutionize LLM training, though further research is needed to explore long-term impacts and safety implications.


Aligning LLMs with human values and preferences is critical for their effective and safe deployment. Typically, LLM training has involved human preference data to tune these models for better task compliance, using diverse approaches like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). However, these methods face limitations due to the finite scope of available human feedback and the static nature of externally built reward models. A novel study examines the concept of Self-Rewarding Language Models, where LLMs act as both respondent to tasks and judge of their own responses, establishing a framework for self-improving, dynamic reward modeling.

Training Self-Rewarding Language Models

The study posits that by endowing language models with dual capabilities—they not only generate responses to tasks but also appraise the quality of generated responses—you achieve self-alignment. This approach involves Iterative DPO training, beginning with a base pretrained LLM supplemented by a limited set of human-annotated data. Subsequent models iterate through a cycle of creating self-instruction examples and then rewarding them based on the model's own judgments. The evaluations are not arbitrary but follow formulated criteria to ensure responses' relevancy, completeness, perspective, and quality.

Methodology Insights

In a series of experiments using the Llama 2 70B model as a base, researchers demonstrate an increase in instructional performance as well as in the model's innate reward-evaluating ability. Through self-generated feedback and Iterative DPO, subsequent models surpassed their predecessor's capabilities, resulting in increasingly sophisticated LLMs. Notably, the performance of these self-rewarded models on AlpacaEval 2.0 surpasses existing LLMs trained using larger, proprietary data sets.

Implications and Future Exploration

Early findings suggest that the concept of Self-Rewarding Language Models could redefine the training of LLMs. By facilitating self-improvement, models may bypass the limitations set by human-derived reward systems. The iterative process potentially enables a continuous quality augmentation beyond existing benchmarks of human feedback quality. However, the long-term saturation of self-rewarding efficiencies, safety implications, and broader evaluative measures have yet to be fully assessed, rendering these findings preliminary yet promising avenues for future research.

