Advancing LLM Reasoning Generalists with Preference Trees (2404.02078v1)
Abstract: We introduce Eurus, a suite of LLMs optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.
- MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proc. of NAACL-HLT, 2019.
- Program synthesis with large language models. ArXiv preprint, abs/2108.07732, 2021.
- Qwen technical report. ArXiv preprint, abs/2309.16609, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv preprint, abs/2204.05862, 2022.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39, 1952.
- Noise contrastive alignment of language models with explicit rewards. ArXiv preprint, abs/2402.05369, 2024a.
- Evaluating large language models trained on code, 2021.
- Theoremqa: A theorem-driven question answering dataset. ArXiv preprint, abs/2305.12524, 2023.
- Agent-flan: Designing data and methods of effective agent tuning for large language models. volume abs/2403.12881, 2024b.
- Training verifiers to solve math word problems. volume abs/2110.14168, 2021.
- Ultrafeedback: Boosting language models with high-quality feedback. ArXiv preprint, abs/2310.01377, 2023.
- DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. ArXiv preprint, abs/2401.02954, 2024.
- Enhancing chat language models by scaling high-quality instructional conversations. In Conference on Empirical Methods in Natural Language Processing, 2023.
- Kto: Model alignment as prospect theoretic optimization. ArXiv preprint, abs/2402.01306, 2024.
- Specializing smaller language models towards multi-step reasoning. In Proceedings of the International Conference on Machine Learning, 2023.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9, 2021.
- Deepseek-coder: When the large language model meets programming - the rise of code intelligence. ArXiv preprint, abs/2401.14196, 2024a.
- Controllable preference optimization: Toward controllable multi-objective alignment. ArXiv preprint, abs/2402.19085, 2024b.
- Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021a.
- Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b.
- Mistral 7b. ArXiv preprint, abs/2310.06825, 2023a.
- Mixtral of experts. 2024.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Annual Meeting of the Association for Computational Linguistics, 2023b.
- Rewardbench: Evaluating reward models for language modeling. 2024.
- Generative judge for evaluating alignment. ArXiv preprint, abs/2310.05470, 2023a.
- Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. ArXiv preprint, abs/2402.19255, 2024.
- Taco: Topics in algorithmic code generation dataset. volume abs/2312.14852, 2023b.
- Competition-level code generation with alphacode. volume abs/2203.07814, 2022.
- Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
- What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. 2023.
- Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In Proceedings of ICLR, 2023.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023a.
- Wizardcoder: Empowering code large language models with evol-instruct, 2023b.
- A diverse corpus for evaluating and developing English math word problem solvers. In Proc. of ACL, 2020.
- NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proc. of ACL, 2022.
- Orca 2: Teaching small language models how to reason. ArXiv preprint, abs/2311.11045, 2023.
- Orca-math: Unlocking the potential of slms in grade school math. ArXiv preprint, abs/2402.14830, 2024.
- OpenAI. Gpt-4 technical report, 2023.
- Compositional semantic parsing on semi-structured tables. In Proc. of ACL, 2015.
- Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
- Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Conference on Empirical Methods in Natural Language Processing, 2023.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv preprint, abs/2307.16789, 2023.
- Direct preference optimization: Your language model is secretly a reward model. ArXiv preprint, abs/2305.18290, 2023.
- Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950, 2023.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv preprint, abs/2402.03300, 2024.
- Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261, 2022.
- Openmathinstruct-1: A 1.8 million math instruction tuning dataset. arXiv preprint arXiv: Arxiv-2402.10176, 2024.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
- Zephyr: Direct distillation of lm alignment. ArXiv preprint, abs/2310.16944, 2023.
- Openchat: Advancing open-source language models with mixed-quality data. ArXiv preprint, abs/2309.11235, 2023a.
- Mint: Evaluating llms in multi-turn interaction with tools and language feedback. ArXiv preprint, abs/2309.10691, 2023b.
- Executable code actions elicit better llm agents. ArXiv preprint, abs/2402.01030, 2024.
- Chain of thought prompting elicits reasoning in large language models. ArXiv preprint, abs/2201.11903, 2022.
- Magicoder: Source code is all you need, 2023.
- Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. ArXiv preprint, abs/2403.09032, 2024.
- Perils of self-feedback: Self-bias amplifies in large language models. ArXiv preprint, abs/2402.11436, 2024.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proc. of EMNLP, 2018.
- Reclor: A reading comprehension dataset requiring logical reasoning. In Proc. of ICLR, 2020.
- Craft: Customizing llms by creating and retrieving from specialized toolsets. ArXiv preprint, abs/2309.17428, 2023.
- Mammoth: Building math generalist models through hybrid instruction tuning. ArXiv preprint, abs/2309.05653, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023.
- Opencodeinterpreter: Integrating code generation with execution and refinement. ArXiv preprint, abs/2402.14658, 2024.
- Instruction-following evaluation for large language models. ArXiv preprint, abs/2311.07911, 2023.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif, 2023.