Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing LLM Reasoning Generalists with Preference Trees (2404.02078v1)

Published 2 Apr 2024 in cs.AI, cs.CL, and cs.LG

Abstract: We introduce Eurus, a suite of LLMs optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proc. of NAACL-HLT, 2019.
  2. Program synthesis with large language models. ArXiv preprint, abs/2108.07732, 2021.
  3. Qwen technical report. ArXiv preprint, abs/2309.16609, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv preprint, abs/2204.05862, 2022.
  5. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39, 1952.
  6. Noise contrastive alignment of language models with explicit rewards. ArXiv preprint, abs/2402.05369, 2024a.
  7. Evaluating large language models trained on code, 2021.
  8. Theoremqa: A theorem-driven question answering dataset. ArXiv preprint, abs/2305.12524, 2023.
  9. Agent-flan: Designing data and methods of effective agent tuning for large language models. volume abs/2403.12881, 2024b.
  10. Training verifiers to solve math word problems. volume abs/2110.14168, 2021.
  11. Ultrafeedback: Boosting language models with high-quality feedback. ArXiv preprint, abs/2310.01377, 2023.
  12. DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. ArXiv preprint, abs/2401.02954, 2024.
  13. Enhancing chat language models by scaling high-quality instructional conversations. In Conference on Empirical Methods in Natural Language Processing, 2023.
  14. Kto: Model alignment as prospect theoretic optimization. ArXiv preprint, abs/2402.01306, 2024.
  15. Specializing smaller language models towards multi-step reasoning. In Proceedings of the International Conference on Machine Learning, 2023.
  16. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9, 2021.
  17. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. ArXiv preprint, abs/2401.14196, 2024a.
  18. Controllable preference optimization: Toward controllable multi-objective alignment. ArXiv preprint, abs/2402.19085, 2024b.
  19. Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021a.
  20. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b.
  21. Mistral 7b. ArXiv preprint, abs/2310.06825, 2023a.
  22. Mixtral of experts. 2024.
  23. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Annual Meeting of the Association for Computational Linguistics, 2023b.
  24. Rewardbench: Evaluating reward models for language modeling. 2024.
  25. Generative judge for evaluating alignment. ArXiv preprint, abs/2310.05470, 2023a.
  26. Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. ArXiv preprint, abs/2402.19255, 2024.
  27. Taco: Topics in algorithmic code generation dataset. volume abs/2312.14852, 2023b.
  28. Competition-level code generation with alphacode. volume abs/2203.07814, 2022.
  29. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
  30. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. 2023.
  31. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In Proceedings of ICLR, 2023.
  32. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023a.
  33. Wizardcoder: Empowering code large language models with evol-instruct, 2023b.
  34. A diverse corpus for evaluating and developing English math word problem solvers. In Proc. of ACL, 2020.
  35. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proc. of ACL, 2022.
  36. Orca 2: Teaching small language models how to reason. ArXiv preprint, abs/2311.11045, 2023.
  37. Orca-math: Unlocking the potential of slms in grade school math. ArXiv preprint, abs/2402.14830, 2024.
  38. OpenAI. Gpt-4 technical report, 2023.
  39. Compositional semantic parsing on semi-structured tables. In Proc. of ACL, 2015.
  40. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
  41. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Conference on Empirical Methods in Natural Language Processing, 2023.
  42. Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv preprint, abs/2307.16789, 2023.
  43. Direct preference optimization: Your language model is secretly a reward model. ArXiv preprint, abs/2305.18290, 2023.
  44. Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950, 2023.
  45. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv preprint, abs/2402.03300, 2024.
  46. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261, 2022.
  47. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. arXiv preprint arXiv: Arxiv-2402.10176, 2024.
  48. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
  49. Zephyr: Direct distillation of lm alignment. ArXiv preprint, abs/2310.16944, 2023.
  50. Openchat: Advancing open-source language models with mixed-quality data. ArXiv preprint, abs/2309.11235, 2023a.
  51. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. ArXiv preprint, abs/2309.10691, 2023b.
  52. Executable code actions elicit better llm agents. ArXiv preprint, abs/2402.01030, 2024.
  53. Chain of thought prompting elicits reasoning in large language models. ArXiv preprint, abs/2201.11903, 2022.
  54. Magicoder: Source code is all you need, 2023.
  55. Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. ArXiv preprint, abs/2403.09032, 2024.
  56. Perils of self-feedback: Self-bias amplifies in large language models. ArXiv preprint, abs/2402.11436, 2024.
  57. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proc. of EMNLP, 2018.
  58. Reclor: A reading comprehension dataset requiring logical reasoning. In Proc. of ICLR, 2020.
  59. Craft: Customizing llms by creating and retrieving from specialized toolsets. ArXiv preprint, abs/2309.17428, 2023.
  60. Mammoth: Building math generalist models through hybrid instruction tuning. ArXiv preprint, abs/2309.05653, 2023.
  61. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023.
  62. Opencodeinterpreter: Integrating code generation with execution and refinement. ArXiv preprint, abs/2402.14658, 2024.
  63. Instruction-following evaluation for large language models. ArXiv preprint, abs/2311.07911, 2023.
  64. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, 2023.
Citations (55)

Summary

  • The paper introduces Eurus, an LLM suite that leverages UltraInteract’s preference trees to boost reasoning performance by over 13.3% on benchmarks like LeetCode and TheoremQA.
  • It details a novel reward modeling approach that overcomes the limitations of traditional preference learning algorithms such as DPO in complex reasoning tasks.
  • The research offers publicly accessible Eurus models and the UltraInteract dataset, enabling further advancements in LLM reasoning strategies.

Advancing LLM Reasoning Generalists with Preference Trees

Introduction to Eurus and UltraInteract

Recent advancements in machine learning have significantly propelled the capabilities of LLMs in diverse tasks. A persistent challenge, however, remains in enhancing LLMs' performance in complex reasoning tasks. This paper introduces Eurus, a suite of LLMs that has achieved remarkable results across a variety of benchmarks in mathematics, code generation, and logical reasoning, owing to the novel dataset UltraInteract. UltraInteract pioneers in offering high-quality, large-scale alignment data specifically curated for complex reasoning, enabling both supervised fine-tuning and advanced preference learning strategies.

Eurus Models: Achievements in Reasoning

Eurus demonstrates exceptional capabilities over existing open-source models and even rivals proprietary models like GPT-3.5 Turbo in reasoning tasks. The noteworthy accomplishments of Eurus include unparalleled performance on stringent benchmarks such as LeetCode and TheoremQA, where it outperforms by more than 13.3% margins. These milestones underscore the efficacy of UltraInteract in sharpening the reasoning skills of LLMs, making Eurus a leading force among reasoning generalists.

UltraInteract: Constructing Preference Trees for Complex Reasoning

UltraInteract stands out with its unique approach of constructing preference trees that encapsulate a variety of reasoning strategies, multi-turn interactions, and action pairs for preference learning. Each preference tree enriches the dataset with diverse reasoning trajectories, promoting flexibility and depth in problem-solving approaches. This broad spectrum of reasoning chains and interaction patterns is instrumental in the remarkable performance leap observed with Eurus models.

Insights from Preference Learning Exploration

A deep dive into preference learning within Eurus reveals intriguing findings. Contrary to conventional applications, algorithms like DPO exhibit decreased suitability for reasoning tasks, hinting at the unique requirements of reasoning over general conversational contexts. This observation led to the development of a novel reward modeling objective that significantly amplified Eurus's reasoning proficiency, showcasing the importance of tailored approaches in preference learning for reasoning capabilities.

Theoretical and Practical Implications

The introduction of Eurus and UltraInteract not only sets new benchmarks in the domain of reasoning within LLMs but also opens avenues for future exploration. The detailed analysis of preference learning algorithms provides foundational insights for what constitutes effective learning paradigms for complex reasoning. Furthermore, the public availability of Eurus models and UltraInteract dataset equips the research community with powerful tools to continue advancing the frontiers of LLM reasoning abilities.

Concluding Remarks

In sum, Eurus represents a significant stride forward in cultivating LLMs' reasoning capacities. Through UltraInteract's meticulously designed preference trees and the exploration of tailored preference learning techniques, Eurus achieves state-of-the-art results, challenging existing paradigms and setting the stage for future innovations in LLM reasoning generalists. The findings from this research not only elevate the capabilities of open-source models but also furnish valuable strategies for enhancing LLMs' reasoning through specialized alignment and learning methodologies.

HackerNews

  1. Can LLMs Every Reason? (1 point, 1 comment)