Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision (2403.09472v2)

Published 14 Mar 2024 in cs.LG, cs.CL, and cs.AI

Abstract: Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans? This paper answers this question in the context of tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e.g., level 1-3 MATH problems), which we term as easy-to-hard generalization. Our key insight is that an evaluator (reward model) trained on supervisions for easier tasks can be effectively used for scoring candidate solutions of harder tasks and hence facilitating easy-to-hard generalization over different levels of tasks. Based on this insight, we propose a novel approach to scalable alignment, which firstly trains the (process-supervised) reward models on easy problems (e.g., level 1-3), and then uses them to evaluate the performance of policy models on hard problems. We show that such easy-to-hard generalization from evaluators can enable easy-to-hard generalizations in generators either through re-ranking or reinforcement learning (RL). Notably, our process-supervised 7b RL model and 34b model (reranking@1024) achieves an accuracy of 34.0% and 52.5% on MATH500, respectively, despite only using human supervision on easy problems. Our approach suggests a promising path toward AI systems that advance beyond the frontier of human supervision.

Citations (30)

View on Semantic Scholar

Summary

The paper introduces an easy-to-hard generalization approach that leverages human-annotated simple tasks to guide complex problem solving.
It demonstrates that process-supervised reward models, via reinforcement learning, significantly enhance generator performance on advanced tasks.
The study outlines a scalable alignment strategy that paves the way for AI systems to surpass human-level supervision in reasoning and problem solving.

Easy-to-Hard Generalization: Advancing AI Beyond Human-Level Supervision

Introduction to Easy-to-Hard Generalization

AI alignment methodologies currently leverage human-generated demonstrations or judgments, inherently bounding the capabilities of AI systems to human-level expertise. A pivotal question emerges: How can AI systems continue to evolve once they surpass human capabilities? This paper explores the concept of easy-to-hard generalization, focusing on scaling AI's ability to tackle complex reasoning tasks (e.g., level 4-5 MATH problems) with only human annotations on simpler tasks (e.g., level 1-3 MATH problems). Through an innovative approach that employs process-supervised reward models trained on simpler problems to evaluate and guide the solution of more complex tasks, the paper introduces a scalable alignment strategy that shows promise for developing AI systems capable of navigating challenges beyond current human expertise.

Generators and Evaluators: Bridging the Gap

Generators' Easy-to-Hard Generalization

Generators, or policy models, trained solely on simpler tasks exhibit varied performance when confronted with more complex tasks. The paper finds that supervised fine-tuning (SFT) consistently outperforms in-context learning (ICL) in generalizing from easy to hard tasks. Interestingly, data quality plays a crucial role in this generalization, with high-quality, well-aligned data from simpler tasks enabling better generalization performances. Despite improvements, a palpable performance gap exists between generators trained on a full spectrum of tasks and those limited to easier tasks, highlighting the challenge of easy-to-hard generalization for generators.

Evaluators' Superior Easy-to-Hard Generalization

Evaluators, particularly process-supervised reward models (PRMs), demonstrate remarkable easy-to-hard generalization capabilities. Through re-ranking strategies like weighted voting and reinforcement learning (RL) approaches, evaluators effectively enhance generator performance on complex tasks. The paper presents a novel Outcome & Process Reward Model (OPRM) that combines the merits of both PRMs and traditional outcome reward models, delivering superior performance across tasks. These findings suggest that evaluators can serve as a significant catalyst in advancing generators' easy-to-hard generalization.

Reinforcement Learning: Harnessing Evaluators for Enhancement

The research moves beyond re-ranking to explore how evaluators can further facilitate generator improvement through reinforcement learning. By optimizing generators against the evaluators, the paper showcases that training with easy-to-hard evaluators via RL achieves notable performance gains. The process reward models, specifically when employed in RL training modes, enable generators to surpass the performance of models trained across a full data spectrum, including harder tasks.

Conclusion and Future Directions

This paper presents a compelling approach to scalable alignment in AI systems, demonstrating the potential for easy-to-hard generalization through the strategic use of process-supervised reward models. By effectively leveraging evaluators trained on simpler tasks, the research outlines a path for AI systems to tackle and excel in problem-solving beyond human-level supervision. These advancements hint at a future where AI can independently push the boundaries of knowledge and problem-solving in various domains. Future work may explore refining the models and methods introduced here, along with extending the approach to a broader range of complex tasks, anchoring the foundation for AI systems that transcend current limitations of human expertise and supervision.

PDF Markdown

Related Papers

Tweets

https://twitter.com/wellecks/status/1839011524670796224

https://twitter.com/EdwardSun0909/status/1770823104433893730

https://twitter.com/wellecks/status/1867643749314306527

https://twitter.com/agarwl_/status/1842688974579458300

https://twitter.com/Montreal_AI/status/1770914843635089490

https://twitter.com/Quebec_AI/status/1770916926241272242