Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 147 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 190 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs (2406.10216v2)

Published 14 Jun 2024 in cs.CL and cs.AI

Abstract: Reward models trained on human preference data have been proven to effectively align LLMs with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models have limited generalization capabilities to unseen prompts and responses, which can lead to an unexpected phenomenon known as reward over-optimization, resulting in a decline in actual performance due to excessive optimization of rewards. While previous research has advocated for constraining policy optimization, our study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's LLM head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities, while concurrently learning a reward head behind the same hidden states. Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviates the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm.

Citations (23)

View on Semantic Scholar

Summary

The paper introduces a novel regularization strategy that enhances LLM reward model generalization by stabilizing hidden state representations.
It integrates text-generation and reward learning using combined loss techniques, such as DPO and SFT principles, to improve performance.
Experimental results demonstrate that the approach reduces over-optimization and boosts robustness in BoN sampling and PPO tasks.

An Overview of Regularization Techniques in Hidden States for Generalizable Reward Models in LLMs

The paper "Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs" addresses a significant challenge in reinforcement learning from human feedback (RLHF) — that of reward over-optimization. This issue arises when the reward models, tuned to align LLMs with human intent, fail to generalize to new and unseen prompts, resulting in models that optimize the learned reward function but not the genuine human preferences.

The authors propose a novel approach involving regularization of the hidden states to enhance the generalization capabilities of reward models amidst distributional shifts in data. The proposed method implements a combined loss strategy that maintains the text-generation capabilities of the LLM while aligning it with the reward model's learning objectives.

Key Technical Contributions

Regularization Methodology: The authors introduce the Generalizable Reward Model (GRM), which retains the base model’s LLM head and applies distinct text-generation losses. This structure is paired with learning a reward head based on the same hidden states, thus allowing simultaneous text-generation and preference learning.
Formulation: The regularization combines DPO and SFT principles, deploying adversarial learning and log-sigmoid transformations for integrating preference learning with generalization.
Experimental Evidence: The paper presents compelling experimental results, demonstrating improved performance of the regularized reward models over conventional methods across multiple out-of-distribution (OOD) tasks, achieving robustness and reducing reward over-optimization.

Results and Implications

Through their methodology, the authors significantly alleviate the over-optimization problem in RLHF. The experimental results indicate that GRM models show higher accuracy on OOD tasks than baseline reward models, especially when the dataset size is limited. This suggests that GRM models possess better generalization prowess due to the regularization of hidden states.

Furthermore, when tested in BoN sampling and PPO scenarios—common policy optimization techniques—the GRM-trained models exhibited superior robustness compared to traditional setups. This highlights the potential of GRM to serve as a reliable proxy for human preferences in LLM applications.

Limitations and Future Prospects

While the paper demonstrates promising improvements, the authors acknowledge certain limitations, particularly regarding computational constraints preventing testing on models larger than 10B parameters. Future work could focus on scaling these insights to larger models and investigating the possible synergistic effects of using actual human-labeled data for further robustness.

Overall, this paper contributes to a growing body of research on mitigating reward model over-optimization by focusing on hidden state regularization. This approach not only provides a deft mechanism for enhancing reward models' generalization but also has wider implications for the development of aligned, robust AI systems. As AI progresses, ensuring the reliability of these models will become increasingly critical, positioning studies like this one at the forefront of methodological innovation in the field.