Emergent Mind

Reinforcement Learning from Human Feedback with Active Queries

(2402.09401)
Published Feb 14, 2024 in cs.LG , cs.AI , cs.CL , math.OC , and stat.ML

Abstract

Aligning LLMs (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d2/\Delta)$ regret bound and an $\tilde{O}(d2/\Delta2)$ query complexity, where $d$ is the dimension of feature space and $\Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

Graph comparing DPO, DPO-AQ, and $\method$ showing faster, higher performance by $\method$.

Overview

  • The paper introduces Active Proximal Policy Optimization (APPO) and Active Direct Policy Optimization (ADPO), leveraging active learning to reduce human feedback queries in Reinforcement Learning from Human Feedback (RLHF).

  • APPO addresses RLHF as a contextual dueling bandit problem, using an uncertainty-aware querying mechanism for efficient feedback gathering.

  • ADPO is shown to achieve superior or comparable model performance with half the human feedback required by traditional methods.

  • The research establishes a regret bound for APPO independent of action space size and demonstrates ADPO's efficiency and applicability across various datasets.

Advancing Query-Efficient Reinforcement Learning from Human Feedback

Introduction to the Challenge

Recent advancements in LLMs have underscored the importance of aligning these models with human preferences to achieve high performance across a broad spectrum of tasks. This alignment is traditionally achieved through Reinforcement Learning from Human Feedback (RLHF). While RLHF has demonstrated promising results, its practical implementation suffers from the necessity of collecting extensive amounts of human-labeled preference data, presenting challenges in terms of both cost and scalability. Addressing this gap, the paper introduces a novel approach leveraging active learning strategies to significantly reduce the number of queries for human feedback while maintaining, or even enhancing, model performance.

Active-Query-Based Reinforcement Learning

The paper proposes an Active Proximal Policy Optimization (APPO) algorithm that incorporates principles of active learning into the RLHF paradigm. APPO innovatively addresses the RLHF problem by formulating it as a contextual dueling bandit problem, a framework where a learner iteratively selects pairs of actions and receives feedback on preferences between them. The key innovation of APPO lies in its uncertainty-aware mechanism for querying human preferences, which dynamically selects only those queries that are likely to provide the most informative feedback for the model. This mechanism is based on an optimistic estimator for the reward gap between different actions, combined with a threshold for observation uncertainty, allowing for a significant reduction in human feedback requirements.

The authors further extend APPO into a practical version for direct preference optimization (DPO) called Active Direct Policy Optimization (ADPO). ADPO fine-tunes LLMs by selectively querying human preferences based on model uncertainty. Impressively, the paper documents that ADPO achieves comparable or superior performance to state-of-the-art methods while requiring about half the number of human queries.

Theoretical Insights and Empirical Validation

Analytically, the paper establishes a regret bound for APPO that is independent of the action space size—a significant theoretical advancement indicating that the algorithm's performance does not degrade with an increase in potential actions. This result, combined with a reduced query complexity, illustrates the algorithm's efficiency and practical applicability.

Empirical results further reinforce the theoretical findings, showcasing that ADPO outperforms traditional DPO methods in aligning LLMs with human preferences while substantially reducing the demand for human-labeled data. The experiments conducted across varying datasets demonstrate the versatility and robustness of ADPO, suggesting its potential applicability in a wide range of LLM fine-tuning scenarios.

Implications and Future Directions

This research opens up new avenues for efficiently and effectively aligning LLMs with human preferences. By introducing an active-querying mechanism, the paper sets the stage for more scalable and cost-effective methods in the field of RLHF. The approach not only mitigates the existing challenges associated with collecting vast amounts of human-labeled data but also enhances the feasibility of deploying advanced LLMs in real-world applications.

Looking forward, while this study offers a comprehensive foundation, the authors identify the need for further theoretical exploration of ADPO as a direction for future work. Such investigations could potentially yield even more refined strategies for active learning in the context of RLHF, unlocking new potentials for the development and deployment of LLMs aligned with human preferences.

In summary, this paper presents a significant leap forward in the quest for efficient reinforcement learning from human feedback by leveraging the power of active queries. Its contributions lay the groundwork for future advancements in the field, promising a new era of more accessible and efficient LLMs tuned to human preferences.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.