Robust Reinforcement Learning from Corrupted Human Feedback (2406.15568v2)

Published 21 Jun 2024 in cs.LG

Abstract: Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- $R^3M$, which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an $\ell_1$-regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach. Theoretically, we prove that under proper regularity conditions, $R^3M$ can consistently learn the underlying reward and identify outliers, provided that the number of outlier labels scales sublinearly with the preference sample size. Furthermore, we remark that $R^3M$ is versatile and can be extended to various preference optimization methods, including direct preference optimization (DPO). Our experiments on robotic control and natural language generation with LLMs show that $R^3M$ improves robustness of the reward against several types of perturbations to the preference data.

Summary

The paper introduces R³M, which uses ℓ1-regularized maximum likelihood estimation to recover true reward functions while detecting outlier corrupted labels.
The method demonstrates computational efficiency and improved performance under various noise models in both robotic control and natural language processing tasks.
Empirical results validate that robust modeling of human feedback enhances policy learning and paves the way for safer, more reliable RLHF implementations.

Robust Reinforcement Learning from Corrupted Human Feedback

In this paper, Bukharin et al. address a significant challenge in the field of Reinforcement Learning from Human Feedback (RLHF): the presence of incorrect or inconsistently provided preference labels by human evaluators. This problem is exacerbated by factors such as personal bias, context ambiguity, and lack of training, among others.

Reinforcement learning (RL) seeks to align AI systems with human preferences, typically sourced via reward modeling from human feedback. However, an inherent issue with this approach is the potential corruption of preference labels due to the erroneous or malicious behavior of human annotators. For example, in a robotic control task, an untrained annotator might prefer aggressive actions that yield higher rewards but compromise safety. Addressing such corruption is pivotal for ensuring the robustness and reliability of AI systems trained using RLHF.

Key Contributions

The authors propose a novel method termed R³M (Robust Reward Modeling for RLHF) designed to navigate the pitfalls associated with corrupted human feedback. This method models corrupted preference labels as sparse outliers and adapts the Bradley-Terry (BT) model to incorporate instance-specific perturbations. The core of R³M lies in its $\ell_1$ -regularized maximum likelihood estimation framework, which jointly learns the reward and identifies potential outliers.

Core Theoretical Insights

Consistency Guarantees: The authors provide theoretical proof that R³M can consistently recover the underlying true reward function while identifying the outliers. This is conditional on the number of outliers scaling sublinearly with the sample size, underpinning the robustness of R³M.
Sparse Perturbations: The $\ell_1$ -regularization encourages sparsity in the perturbation factors, effectively identifying and marginalizing the impact of corrupted preference labels.
Computational Efficiency: Using an alternating optimization algorithm, R³M updates the reward parameters and perturbation factors iteratively. The approach incurs negligible computational overhead compared to standard RLHF.

Experimental Results

The robustness and efficacy of R³M are substantiated through extensive experiments across robotic control and natural language generation tasks.

Robotic Control

Experiments conducted on PyBullet environments such as HalfCheetah, Ant, and Hopper demonstrate that R³M outperforms the standard RLHF method under several noise models, including stochastic, myopic, and irrational noise. Notably, even under higher noise intensities, R³M shows marked improvements in maintaining high normalized returns.

Natural Language Generation

In natural language tasks, R³M is extended to Direct Preference Optimization (DPO). Evaluation on dialogue and summarization tasks shows that $R^3M$ -DPO consistently outperforms DPO and other baselines, including SLiC-HF, IPO, and KTO, with an enhancement in both win rates and winning scores against the instruction-tuned reference models.

An intriguing finding is that popular RLHF datasets like those used for summarization and dialogue may contain noisy human preferences. This is evidenced by the improved performance of $R^3M$ -DPO, suggesting that R³M's corruption-robust approach is beneficial even for standard datasets.

Implications and Future Directions

The proposed approach has significant implications for the future development and deployment of RLHF systems:

Enhanced Policy Learning: By effectively handling corrupted preferences, R³M can lead to more reliable policy learning, critical for applications such as robotic control, automated moderation, and AI-assisted dialogue systems.
Adopting Robust Techniques in RLHF: The success of R³M suggests an increased need for integrating robust statistical techniques into RLHF to account for potential human biases and errors.
Extension to Broader AI Applications: While this work focuses on specific tasks, the underlying principles of modeling and mitigating outliers can be extended to other AI domains where human feedback is pivotal.

Future research could explore the following avenues:

Scalability: Extend the approach to handle even larger datasets and more complex AI systems.
Non-sparse Corruption Models: Investigate methods to handle non-sparse or densely corrupted preference data.
User-friendly Implementations: Develop user-friendly tools and libraries that make it easier for AI practitioners to apply robust RLHF techniques.

In conclusion, this paper offers a comprehensive and practical solution to a pressing issue in RLHF, with the potential to significantly impact the robustness and reliability of AI systems oriented towards human preferences.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tourzhao/status/1806003234504388911