Emergent Mind

Advancing LLM Reasoning Generalists with Preference Trees

(2404.02078)
Published Apr 2, 2024 in cs.AI , cs.CL , and cs.LG

Abstract

We introduce Eurus, a suite of LLMs optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

Eurus models rival larger baselines on LeetCode, TheoremQA benchmarks, matching GPT-3.5 Turbo's performance.

Overview

  • The paper introduces Eurus, a suite of LLMs, and UltraInteract, a novel dataset designed to improve complex reasoning in LLMs through fine-tuning and preference learning.

  • Eurus models have demonstrated superior reasoning performance over both open-source and some proprietary models, excelling in benchmarks like LeetCode and TheoremQA.

  • UltraInteract employs preference trees to offer diverse reasoning strategies and multi-turn interactions, significantly enhancing the problem-solving abilities of LLMs.

  • The research highlights the development of a new reward modeling objective in preference learning, leading to improved reasoning proficiency in the Eurus models.

Advancing LLM Reasoning Generalists with Preference Trees

Introduction to Eurus and UltraInteract

Recent advancements in machine learning have significantly propelled the capabilities of LLMs in diverse tasks. A persistent challenge, however, remains in enhancing LLMs' performance in complex reasoning tasks. This paper introduces Eurus, a suite of LLMs that has achieved remarkable results across a variety of benchmarks in mathematics, code generation, and logical reasoning, owing to the novel dataset UltraInteract. UltraInteract pioneers in offering high-quality, large-scale alignment data specifically curated for complex reasoning, enabling both supervised fine-tuning and advanced preference learning strategies.

Eurus Models: Achievements in Reasoning

Eurus demonstrates exceptional capabilities over existing open-source models and even rivals proprietary models like GPT-3.5 Turbo in reasoning tasks. The noteworthy accomplishments of Eurus include unparalleled performance on stringent benchmarks such as LeetCode and TheoremQA, where it outperforms by more than 13.3% margins. These milestones underscore the efficacy of UltraInteract in sharpening the reasoning skills of LLMs, making Eurus a leading force among reasoning generalists.

UltraInteract: Constructing Preference Trees for Complex Reasoning

UltraInteract stands out with its unique approach of constructing preference trees that encapsulate a variety of reasoning strategies, multi-turn interactions, and action pairs for preference learning. Each preference tree enriches the dataset with diverse reasoning trajectories, promoting flexibility and depth in problem-solving approaches. This broad spectrum of reasoning chains and interaction patterns is instrumental in the remarkable performance leap observed with Eurus models.

Insights from Preference Learning Exploration

A deep dive into preference learning within Eurus reveals intriguing findings. Contrary to conventional applications, algorithms like DPO exhibit decreased suitability for reasoning tasks, hinting at the unique requirements of reasoning over general conversational contexts. This observation led to the development of a novel reward modeling objective that significantly amplified Eurus's reasoning proficiency, showcasing the importance of tailored approaches in preference learning for reasoning capabilities.

Theoretical and Practical Implications

The introduction of Eurus and UltraInteract not only sets new benchmarks in the domain of reasoning within LLMs but also opens avenues for future exploration. The detailed analysis of preference learning algorithms provides foundational insights for what constitutes effective learning paradigms for complex reasoning. Furthermore, the public availability of Eurus models and UltraInteract dataset equips the research community with powerful tools to continue advancing the frontiers of LLM reasoning abilities.

Concluding Remarks

In sum, Eurus represents a significant stride forward in cultivating LLMs' reasoning capacities. Through UltraInteract's meticulously designed preference trees and the exploration of tailored preference learning techniques, Eurus achieves state-of-the-art results, challenging existing paradigms and setting the stage for future innovations in LLM reasoning generalists. The findings from this research not only elevate the capabilities of open-source models but also furnish valuable strategies for enhancing LLMs' reasoning through specialized alignment and learning methodologies.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

HackerNews
Can LLMs Every Reason? (1 point, 1 comment)