Reinforcement Learning from Human Feedback with Active Queries (2402.09401v2)

Published 14 Feb 2024 in cs.LG, cs.AI, cs.CL, math.OC, and stat.ML

Abstract: Aligning LLMs (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/\Delta)$ instance-dependent regret bound and an $\tilde{O}(d^{2/\Delta^2)$} query complexity, where $d$ is the dimension of feature space and $\Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

Citations (12)

View on Semantic Scholar

Summary

The paper introduces the APPO algorithm that integrates active learning into RLHF by selectively querying informative human feedback.
It demonstrates that the approach achieves comparable or superior LLM performance while requiring about half the human queries compared to traditional methods.
The study establishes a regret bound independent of the action space size, highlighting the scalability and efficiency of the proposed method.

Advancing Query-Efficient Reinforcement Learning from Human Feedback

Introduction to the Challenge

Recent advancements in LLMs have underscored the importance of aligning these models with human preferences to achieve high performance across a broad spectrum of tasks. This alignment is traditionally achieved through Reinforcement Learning from Human Feedback (RLHF). While RLHF has demonstrated promising results, its practical implementation suffers from the necessity of collecting extensive amounts of human-labeled preference data, presenting challenges in terms of both cost and scalability. Addressing this gap, the paper introduces a novel approach leveraging active learning strategies to significantly reduce the number of queries for human feedback while maintaining, or even enhancing, model performance.

Active-Query-Based Reinforcement Learning

The paper proposes an Active Proximal Policy Optimization (APPO) algorithm that incorporates principles of active learning into the RLHF paradigm. APPO innovatively addresses the RLHF problem by formulating it as a contextual dueling bandit problem, a framework where a learner iteratively selects pairs of actions and receives feedback on preferences between them. The key innovation of APPO lies in its uncertainty-aware mechanism for querying human preferences, which dynamically selects only those queries that are likely to provide the most informative feedback for the model. This mechanism is based on an optimistic estimator for the reward gap between different actions, combined with a threshold for observation uncertainty, allowing for a significant reduction in human feedback requirements.

The authors further extend APPO into a practical version for direct preference optimization (DPO) called Active Direct Policy Optimization (ADPO). ADPO fine-tunes LLMs by selectively querying human preferences based on model uncertainty. Impressively, the paper documents that ADPO achieves comparable or superior performance to state-of-the-art methods while requiring about half the number of human queries.

Theoretical Insights and Empirical Validation

Analytically, the paper establishes a regret bound for APPO that is independent of the action space size—a significant theoretical advancement indicating that the algorithm's performance does not degrade with an increase in potential actions. This result, combined with a reduced query complexity, illustrates the algorithm's efficiency and practical applicability.

Empirical results further reinforce the theoretical findings, showcasing that ADPO outperforms traditional DPO methods in aligning LLMs with human preferences while substantially reducing the demand for human-labeled data. The experiments conducted across varying datasets demonstrate the versatility and robustness of ADPO, suggesting its potential applicability in a wide range of LLM fine-tuning scenarios.

Implications and Future Directions

This research opens up new avenues for efficiently and effectively aligning LLMs with human preferences. By introducing an active-querying mechanism, the paper sets the stage for more scalable and cost-effective methods in the field of RLHF. The approach not only mitigates the existing challenges associated with collecting vast amounts of human-labeled data but also enhances the feasibility of deploying advanced LLMs in real-world applications.

Looking forward, while this paper offers a comprehensive foundation, the authors identify the need for further theoretical exploration of ADPO as a direction for future work. Such investigations could potentially yield even more refined strategies for active learning in the context of RLHF, unlocking new potentials for the development and deployment of LLMs aligned with human preferences.

In summary, this paper presents a significant leap forward in the quest for efficient reinforcement learning from human feedback by leveraging the power of active queries. Its contributions lay the groundwork for future advancements in the field, promising a new era of more accessible and efficient LLMs tuned to human preferences.

PDF Markdown

Related Papers

Tweets

https://twitter.com/StatMLPapers/status/1757994265118900282

https://twitter.com/verif_papers/status/1757975919875203427