Advancing LLM Reasoning Generalists with Preference Trees (2404.02078v1)

Published 2 Apr 2024 in cs.AI, cs.CL, and cs.LG

Abstract: We introduce Eurus, a suite of LLMs optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

References (64)

Citations (55)

View on Semantic Scholar

Summary

The paper introduces Eurus, an LLM suite that leverages UltraInteract’s preference trees to boost reasoning performance by over 13.3% on benchmarks like LeetCode and TheoremQA.
It details a novel reward modeling approach that overcomes the limitations of traditional preference learning algorithms such as DPO in complex reasoning tasks.
The research offers publicly accessible Eurus models and the UltraInteract dataset, enabling further advancements in LLM reasoning strategies.

Advancing LLM Reasoning Generalists with Preference Trees

Introduction to Eurus and UltraInteract

Recent advancements in machine learning have significantly propelled the capabilities of LLMs in diverse tasks. A persistent challenge, however, remains in enhancing LLMs' performance in complex reasoning tasks. This paper introduces Eurus, a suite of LLMs that has achieved remarkable results across a variety of benchmarks in mathematics, code generation, and logical reasoning, owing to the novel dataset UltraInteract. UltraInteract pioneers in offering high-quality, large-scale alignment data specifically curated for complex reasoning, enabling both supervised fine-tuning and advanced preference learning strategies.

Eurus Models: Achievements in Reasoning

Eurus demonstrates exceptional capabilities over existing open-source models and even rivals proprietary models like GPT-3.5 Turbo in reasoning tasks. The noteworthy accomplishments of Eurus include unparalleled performance on stringent benchmarks such as LeetCode and TheoremQA, where it outperforms by more than 13.3% margins. These milestones underscore the efficacy of UltraInteract in sharpening the reasoning skills of LLMs, making Eurus a leading force among reasoning generalists.

UltraInteract: Constructing Preference Trees for Complex Reasoning

UltraInteract stands out with its unique approach of constructing preference trees that encapsulate a variety of reasoning strategies, multi-turn interactions, and action pairs for preference learning. Each preference tree enriches the dataset with diverse reasoning trajectories, promoting flexibility and depth in problem-solving approaches. This broad spectrum of reasoning chains and interaction patterns is instrumental in the remarkable performance leap observed with Eurus models.

Insights from Preference Learning Exploration

A deep dive into preference learning within Eurus reveals intriguing findings. Contrary to conventional applications, algorithms like DPO exhibit decreased suitability for reasoning tasks, hinting at the unique requirements of reasoning over general conversational contexts. This observation led to the development of a novel reward modeling objective that significantly amplified Eurus's reasoning proficiency, showcasing the importance of tailored approaches in preference learning for reasoning capabilities.

Theoretical and Practical Implications

The introduction of Eurus and UltraInteract not only sets new benchmarks in the domain of reasoning within LLMs but also opens avenues for future exploration. The detailed analysis of preference learning algorithms provides foundational insights for what constitutes effective learning paradigms for complex reasoning. Furthermore, the public availability of Eurus models and UltraInteract dataset equips the research community with powerful tools to continue advancing the frontiers of LLM reasoning abilities.

Concluding Remarks

In sum, Eurus represents a significant stride forward in cultivating LLMs' reasoning capacities. Through UltraInteract's meticulously designed preference trees and the exploration of tailored preference learning techniques, Eurus achieves state-of-the-art results, challenging existing paradigms and setting the stage for future innovations in LLM reasoning generalists. The findings from this research not only elevate the capabilities of open-source models but also furnish valuable strategies for enhancing LLMs' reasoning through specialized alignment and learning methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1775348677432762700

https://twitter.com/arankomatsuzaki/status/1775346043321757883

https://twitter.com/MaitrixOrg/status/1791569330271887413

https://twitter.com/BrianRoemmele/status/1775350678371860559

https://twitter.com/fly51fly/status/1775638964181037164

https://twitter.com/aipaperspodcast/status/1780276577680662891

HackerNews

Can LLMs Every Reason? (1 point, 1 comment)