Papers
Topics
Authors
Recent
2000 character limit reached

Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training (2309.17179v2)

Published 29 Sep 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Recent works like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by using tree-search algorithms to guide multi-step reasoning. These methods rely on prompting a pre-trained model to serve as a value function and focus on problems with low search depth. As a result, these methods will not work in domains where the pre-trained LLM does not have enough knowledge to serve as an effective value function or in domains that require long-horizon planning. To address these limitations, we present an AlphaZero-like tree-search learning framework for LLMs (termed TS-LLM), systematically illustrating how tree-search with a learned value function can guide LLM decoding. TS-LLM distinguishes itself in two key ways. (1) Leveraging a learned value function and AlphaZero-like algorithms, our approach can be generally adaptable to a wide range of tasks, LLMs of any size, and tasks of varying search depths. (2) Our approach can guide LLMs during both inference and training, iteratively improving the LLM. Empirical results across reasoning, planning, alignment, and decision-making tasks show that TS-LLM outperforms existing approaches and can handle trees with a depth of 64.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  2. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  3. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  5. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72–83. Springer, 2006.
  6. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022.
  7. Dahoas. Synthetic-instruct-gptj-pairwise. https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise.
  8. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  9. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200, 2023.
  10. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  11. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  12. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  13. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  14. Learning and planning in complex action spaces. In International Conference on Machine Learning, pp. 4476–4486. PMLR, 2021.
  15. Maieutic prompting: Logically consistent reasoning with recursive explanations. arXiv preprint arXiv:2205.11822, 2022.
  16. Bandit based monte-carlo planning. In European conference on machine learning, pp.  282–293. Springer, 2006.
  17. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  18. Machine translation decoding beyond beam search. arXiv preprint arXiv:2104.05336, 2021.
  19. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  20. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  21. Making ppo even better: Value-guided monte-carlo tree search decoding, 2023.
  22. Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
  23. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  24. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  25. Nadia Matulewicz. Inductive program synthesis through using monte carlo tree search guided by a heuristic-based loss function. 2022.
  26. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  27. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  28. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  29. Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
  30. Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.
  31. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  32. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  33. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  34. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017a.
  35. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017b.
  36. Reinforcement learning: An introduction. MIT press, 2018.
  37. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  40. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
  41. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  43. Decomposition enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633, 2023.
  44. Haotian Xu. No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function. arXiv preprint arXiv:2309.03224, 2023.
  45. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  46. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023a.
  47. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023b.
  48. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  49. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510, 2023.
  50. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  51. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  52. Solving math word problem via cooperative reasoning induced language models. arXiv preprint arXiv:2210.16257, 2022.
Citations (67)

Summary

  • The paper presents TS-LLM, an AlphaZero-inspired tree-search framework that integrates a learned value function to guide LLM decoding and training.
  • It employs dual-purpose methods by combining policy distillation and value function learning to iteratively enhance reasoning and output accuracy.
  • Empirical results demonstrate significant performance gains in deep planning tasks, outperforming CoT baselines on benchmarks like RLHF alignment.

Alphazero-like Tree-Search can Guide LLM Decoding and Training

Introduction

Recent advancements in guiding LLMs through tree-search algorithms highlight the potential for enhanced reasoning capabilities. Traditional approaches like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) have demonstrated the ability to boost performance using tree-search methods such as BFS/DFS and MCTS. However, these methods are constrained by shallow search depths, typically limited to 10 or fewer steps, impeding effectiveness in tasks requiring deeper planning.

This study introduces an AlphaZero-like framework (TS-LLM) that leverages a learned value function to expand the applicability of tree-search algorithms in LLM decoding and training across various problem domains with greater search depths.

Key Innovations

  1. Learned Value Function: TS-LLM utilizes a value function adapted from LLMs to provide more reliable evaluations than prompt-based self-assessment techniques.
  2. Dual Purpose - Training and Inference: Unlike methods solely focused on inference, TS-LLM integrates tree-search into the training process of LLMs, encouraging iterative enhancement through policy distillation and value function learning. Figure 1

    Figure 1: Overview of TS-LLM showing sentence-level and token-level node expansion paradigms for tree-search integration.

Methodology

Tree-Search Algorithm

TS-LLM adopts AlphaZero-like tree-search algorithms to guide the decision-making in LLMs during both inference and training phases. The tree is constructed with node expansion governed by either a sentence or token-level basis, depending on the nature of the task. This setup facilitates deep search capabilities.

  • Search Algorithm Variants:
    • BFS-V and DFS-V: These algorithms utilize value-based pruning during breadth and depth traversal respectively.
    • MCTS and MCTS-α\alpha: Combining Monte Carlo methods with value function approximations ensures robust search through potential outputs, optimizing cumulative rewards.

Evaluation and Training

The learned value function and reward models are developed using training datasets with rewards labelled based on task-specific outcomes. TS-LLM's novel training paradigm iteratively refines LLM performance:

  • Policy Improvement: Tree-search enhances the generation dataset.
  • Policy Distillation: Supervised learning from augmented data.
  • Policy Evaluation: Continual adaptation of value functions using augmented samples.

Empirical Analysis

Performance Metrics

Results demonstrate TS-LLM's superiority in deep planning tasks compared to traditional methods. Evaluation metrics across diverse tasks, including reasoning and alignment, illustrate significant improvements in accuracy and reward optimization.

  • Comparison of Path@1: TS-LLM consistently outperforms CoT baselines, particularly in complex tasks like Chess Endgame and RLHF alignment, capable of handling trees with depth up to 64. Figure 2

    Figure 2: Aggregated task results showing the progression in performance with increasing tree-search depth.

Scalability and Efficiency

The integration of tree-search in both training and inference ensures scalable application across tasks of varying complexity. However, computational overhead, especially during node expansion, suggests a need for further optimizations.

Conclusion

TS-LLM represents a significant step forward in integrating advanced tree-search methodologies with LLMs, supporting improvements in both performance and training efficiency. Future exploration may focus on addressing computational burdens and expanding the framework's applicability to broader domains, potentially transforming practices in LLM-based decision-making and reasoning tasks.

This work lays the foundation for continued advancements in AI-driven LLM optimization, fostering robust and adaptive AI systems for complex problem-solving scenarios.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 34 likes about this paper.