Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 70 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

RLHF Workflow: From Reward Modeling to Online RLHF (2405.07863v3)

Published 13 May 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent LLM literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.

Citations (52)

Summary

  • The paper presents an online iterative RLHF pipeline that integrates human feedback via a Bradley-Terry based reward model for real-time LLM alignment.
  • It introduces a preference modeling strategy that contrasts pairwise responses to capture nuanced human insights while mitigating verbosity bias.
  • The approach employs iterative policy optimization and exploration techniques, achieving superior performance on both conversational and academic benchmarks.

RLHF Workflow: From Reward Modeling to Online RLHF

This essay delineates the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF), aimed at enhancing LLMs by integrating human preference signals. The paper delineates a practical recipe for implementing online iterative RLHF, addressing a significant gap in open-source RLHF projects, which are predominantly confined to offline learning settings.

Introduction to RLHF

Reinforcement Learning from Human Feedback (RLHF) constitutes a pivotal methodology for aligning LLMs with human values and preferences. Unlike supervised fine-tuning, RLHF incorporates human feedback, offering a more dynamic and iterative approach to training models. RLHF workflows generally involve a policy model π\pi, a preference oracle, and reward maximization strategies using methods like the Bradley-Terry model (Figure 1). Figure 2

Figure 2: A simplified illustration of reward modeling and online iterative RLHF.

Reward Modeling Approach

Preference Datasets

A diverse set of open-source datasets serves as the foundation for reward and preference modeling, including HH-RLHF, SHP, HelpSteer, UltraFeedback, among others. These datasets provide varied contexts and human-annotated preferences essential for constructing robust models.

Bradley-Terry Reward Model

The reward model is implemented using the Maximum Likelihood Estimator (MLE) of the Bradley-Terry (BT) model. By training the model using preference data, the BT model approximates human feedback signals effectively, albeit exhibiting some limitations such as verbosity bias (Figure 3). Figure 1

Figure 1: Illustration of the Bradley-Terry (BT) model and preference model.

Preference Model Construction

The preference model, unlike the scalar reward model, evaluates pairwise preferences between responses, providing nuanced insights into human feedback. The training of the preference model employs a strategy of optimizing for human-like responses by contrasting pairs of alternative outputs. Figure 4

Figure 4

Figure 4

Figure 4: The training record of reward modeling. From the left to right, we present the records of training loss, gradient norm, and the learning rate, respectively.

Online Iterative RLHF Framework

Iterative Policy Optimization

Iterative policy optimization leverages data collected during training iterations to refine the model progressively (Figure 5). Unlike static datasets used in offline RLHF, this method dynamically integrates new data, improving out-of-distribution generalization and reducing over-optimization risks.

Exploration Strategies

The exploration strategy involves generating diverse model responses and leveraging techniques like rejection sampling and temperature adjustments to explore various potential outputs. This method, combined with strategic policy optimization, facilitates efficient navigation of the model's action space. Figure 5

Figure 5: Illustration of our implementation of iterative direct preference learning. In iteration t=1t=1, the historical dataset is empty, and the resulting policy model π1\pi_{1} is the same as its initialization.

Evaluation

Conversational and Academic Benchmarks

The resulting model demonstrates superior performance on conversational benchmarks like AlpacaEval-2, MT-Bench, and Chat-Arena-Hard. Additionally, academic benchmarks reveal no significant regression in reasoning capabilities, suggesting iterative RLHF does not adversely affect LLM performance in intellectual tasks (Figure 6). Figure 6

Figure 6

Figure 6: Evaluation of our models and LLaMA-3-8B-inst.

Length Bias Mitigation

A length penalty incorporated in reward modeling effectively mitigates verbosity bias, yielding concise responses without compromising alignment quality. This adjustment is validated through comparative analysis with other models. Figure 3

Figure 3

Figure 3: The heatmap of the Pearson correlation coefficients between reward and response length. For each prompt, we use the SFT model to generate 16 responses and compute the coefficient.

Conclusion

The workflow of Online Iterative RLHF provides a comprehensive methodology for LLM alignment, combining theoretical insights with practical implementation strategies. Further exploration of reward modeling nuances and exploration techniques can enhance model efficiency and performance, contributing to the development of more robust and human-aligned AI systems. Future directions include refining preference signal modeling and exploring more effective exploration strategies beyond rejection sampling.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 19 tweets and received 861 likes.

Upgrade to Pro to view all of the tweets about this paper: