Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 143 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 85 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing (2404.12253v2)

Published 18 Apr 2024 in cs.CL and cs.LG

Abstract: Despite the impressive capabilities of LLMs on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  3. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  17682–17690, 2024.
  4. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
  5. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  6. Intrinsically motivated reinforcement learning. Advances in neural information processing systems, 17, 2004.
  7. Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  9. Training deep convolutional neural networks to play go. In International conference on machine learning, pp.  1766–1774. PMLR, 2015.
  10. Jeffery Allen Clouse. On integrating apprentice learning and reinforcement learning. University of Massachusetts Amherst, 1996.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. Monte carlo tree search with options for general video game playing. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp.  1–8. IEEE, 2016.
  13. Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv preprint arXiv:2311.04254, 2023.
  14. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023.
  15. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  16. Critic: Large language models can self-correct with tool-interactive critiquing. In Second Agent Learning in Open-Endedness Workshop, 2023a.
  17. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023b.
  18. Human-instruction-free llm self-alignment with limited samples. arXiv preprint arXiv:2401.06785, 2024.
  19. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  8154–8173, 2023.
  20. Measuring mathematical problem solving with the math dataset, 2021.
  21. A closer look at the self-verification abilities of large language models in logical reasoning. arXiv preprint arXiv:2311.07954, 2023.
  22. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  23. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  24. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023.
  25. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  26. Making ppo even better: Value-guided monte-carlo tree search decoding. arXiv preprint arXiv:2309.15028, 2023.
  27. Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
  28. A survey of reinforcement learning informed by natural language. ArXiv, abs/1906.03926, 2019. URL https://api.semanticscholar.org/CorpusID:182952502.
  29. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  30. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
  31. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  32. R OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023.
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  34. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017.
  35. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  36. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. ArXiv, abs/2210.01241, 2022. URL https://api.semanticscholar.org/CorpusID:252693405.
  37. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  38. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  39. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  40. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
  41. On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024.
  42. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
  43. Reinforcement learning: An introduction. MIT press, 2018.
  44. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211, 1999a. ISSN 0004-3702. doi: https://doi.org/10.1016/S0004-3702(99)00052-1. URL https://www.sciencedirect.com/science/article/pii/S0004370299000521.
  45. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999b.
  46. Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. University of Massachusetts Amherst, 1984.
  47. Reinforcement learning agents providing advice in complex video games. Connection Science, 26(1):45–63, 2014.
  48. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023a.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  51. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
  52. Revisiting move groups in monte-carlo tree search. In Advances in Computer Games: 13th International Conference, ACG 2011, Tilburg, The Netherlands, November 20-22, 2011, Revised Selected Papers 13, pp.  13–23. Springer, 2012.
  53. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023a.
  54. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592, 2023b.
  55. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  56. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  57. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems, 36, 2024.
  58. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  59. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  60. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  61. Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078, 2024a.
  62. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024b.
  63. Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning. arXiv preprint arXiv:2401.17686, 2024.
Citations (31)

Summary

  • The paper introduces AlphaLLM, a framework that leverages Monte Carlo Tree Search and critic models to iteratively enhance LLM self-improvement.
  • It employs synthetic prompt generation and efficient search strategies to address data scarcity and explore vast token combinations.
  • Empirical results on GSM8K and MATH datasets demonstrate near-GPT-4 accuracy, highlighting significant improvements in reasoning and planning.

Overview of "Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing"

This essay provides an insight into the mechanics and application of "Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing" (2404.12253). The paper introduces AlphaLLM, a novel framework that fosters self-improvement in LLMs by integrating Monte Carlo Tree Search (MCTS) with LLMs. The process involves imagination for data synthesis, efficient searching strategies, and critical evaluation models, drawing from the success principles of AlphaGo.

Introduction

LLMs are highly capable across a variety of NLP tasks but face significant challenges in complex reasoning and planning scenarios. Standard methods like advanced prompting and supervised fine-tuning rely heavily on high-quality datasets, which can be scarce and limited. To address these challenges, self-improvement strategies use feedback from past responses and leverage self-assessed rewards, yet concerns remain about the efficacy of LLMs' self-correction capabilities, especially in tasks requiring complex reasoning.

Inspired by AlphaGo's success, AlphaLLM integrates MCTS with LLMs to improve exploration and learning in language tasks. This integration poses challenges like data scarcity, large token combination spaces, and subjective feedback in natural language tasks. AlphaLLM's framework includes prompt synthesis for data generation, efficient search strategies for exploration, and critic models for feedback. Figure 1

Figure 1: Imagination-Searching-Criticizing self-improvement loop: Imagination component synthesizes prompts as new learning examples, with MCTS searching better trajectories guided by signals from critics for policy improving.

AlphaLLM builds on existing research in search strategies and LLM self-improvement. Beam search techniques and MCTS variants have been studied for complex reasoning tasks like math problem solving. The approach involves maintaining a flexible search step definition and exploring the integration of reinforcement learning with LLM self-correction.

Advanced methods of prompt synthesis, such as Self-instruct and Evol-instruct, aid in creating diverse data for LLM training. Self-improvement frameworks evolved from initial heuristic rule-based refinement to leveraging LLMs for self-assessment, particularly in generating critique data or using external tools for better trajectory evaluations.

AlphaLLM Framework

Data Synthesizing

The data synthesizing component of AlphaLLM offsets data scarcity by generating synthetic prompts from initial datasets or existing tasks. This synthesis uses transformation functions that might include LLM-generated or heuristic-based instructions, thereby enhancing the diversity and robustness of training data.

Monte Carlo Tree Search (MCTS)

AlphaLLM employs Option-level MCTS to address the vast search space of LLMs. Unlike token-level or sentence-level approaches, option-level MCTS uses sequences of tokens or phrases, improving search efficiency by reducing depth and exploring broader possibilities. Components include importance-weighted expansion for dynamic branching, state merging for maximizing diversity in states, and a fast rollout policy using specialized LLMs. Figure 2

Figure 2: An overview of the four operations of MCTS. A node is selected, expanded, simulated with fast rollout policy until a terminal node is reached, then the signals from value function, PRM, and ORM are backpropagated.

Critic Models

Critic models in AlphaLLM include a value function for future reward prediction, PRM for immediate node evaluation, and ORM for assessing the trajectory's overall quality. These models are trained using specialized datasets and leverage both intrinsic knowledge and external tools for comprehensive trajectory evaluation.

Policy Self-Improvement

AlphaLLM's self-improvement process iteratively refines policies through data generation and model fine-tuning. Synthetic prompts and high-quality MCTS-generated trajectories feed into the training loop, with results evaluated against benchmarks to ensure continual self-improvement. Figure 3

Figure 3: Empirical analysis on GSM8K of different self-improving data collection methods and number of iterations. Models are evaluated with greedy decoding, MCTS with small #rollout, and large #rollout. Two iterations of self-improvement are conducted using data from reranking and MCTS.

Experimental Results

AlphaLLM demonstrates significant performance improvements over base models using GSM8K and MATH datasets, achieving near-GPT-4 accuracy. Empirical results highlight AlphaLLM's efficiency with MCTS decoding and suggest potential iterative enhancements with fewer labeled data requirements, paving the way for scalable self-improvement strategies in LLMs.

Conclusion

AlphaLLM represents a significant advancement in self-improvement for LLMs via imagination, searching, and criticizing. By overcoming challenges associated with data scarcity, search efficiency, and subjective feedback, AlphaLLM fosters continual improvement in complex language tasks, drawing parallels with AlphaGo and indicating a promising direction for future reinforcement learning applications in LLMs.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 32 tweets and received 885 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com