Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 30 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent (2312.10003v1)

Published 15 Dec 2023 in cs.CL

Abstract: Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a LLM to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Vladimir Blagojevic. Long-form qa beyond eli5: an updated dataset and approach, 2022. URL towardsdatascience.com/long-form-qa-beyond-eli5-an-updated-dataset-and-approach-319cb841aabb.
  3. Harrison Chase. Langchain. https://github.com/hwchase17/langchain, 2022.
  4. Fireact: Toward language agent fine-tuning, 2023.
  5. Language model cascades, 2022.
  6. Raft: Reward ranked finetuning for generative foundation model alignment, 2023.
  7. ELI5: long form question answering. CoRR, abs/1907.09190, 2019. URL http://arxiv.org/abs/1907.09190.
  8. Pal: Program-aided language models, 2023.
  9. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  10. Large language models cannot self-correct reasoning yet, 2023.
  11. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp, 2023a.
  12. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023b.
  13. Hurdles to progress in long-form question answering, 2021.
  14. Let’s verify step by step, 2023.
  15. Jerry Liu. Llamaindex. https://github.com/jerryjliu/llama_index, 2022.
  16. Self-refine: Iterative refinement with self-feedback, 2023.
  17. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  18. Measuring and narrowing the compositionality gap in language models, 2023.
  19. Iterated decomposition: Improving science q&a by supervising reasoning processes, 2023.
  20. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  21. Beyond human data: Scaling self-training for problem-solving with language models, 2023.
  22. Solving math word problems with process- and outcome-based feedback, 2022.
  23. The rise and potential of large language model based agents: A survey, 2023.
  24. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. CoRR, abs/1809.09600, 2018. URL http://arxiv.org/abs/1809.09600.
  25. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  26. Star: Bootstrapping reasoning with reasoning, 2022.
Citations (34)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates a self-improving multi-step reasoning agent that integrates ReAct and ReST to iteratively enhance LLM performance without human labels.
  • It employs a search loop, AI feedback, and few-shot Python prompts to refine responses, achieving accuracy gains with smaller models.
  • Experimental results indicate significant improvements after two iterations, matching performance of larger pre-trained models.

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

Introduction

The paper discusses a strategy to enhance the performance of LLM (LM) agents, specifically for complex natural language tasks requiring multi-step reasoning. It integrates ReAct and ReST approaches to create a self-improving multi-step reasoning agent, enabling it to refine its process through AI feedback, without human-labeled training data. The approach utilizes a combination of retrieval and generation techniques for question-answering tasks, aiming to optimize the agent's ability to interact with external knowledge.

ReAct and ReST Approaches

The ReAct framework allows for reasoning intertwined with execution actions, forming a foundation for the proposed agent. ReACT agents systematically intersperse thought, action, and observation cycles to construct answers based on information retrieval. On the other hand, ReST focuses on iterative training using previously gathered data, akin to reinforcement learning methods. Here, AI feedback facilitates the refinement of processes instead of direct outcomes, thus propelling the agent towards self-improvement.

System Architecture

The architecture of the proposed system involves multiple decision layers, including a search loop and subsequent self-revision steps. The agent begins by exploring whether additional information is needed for a given question, employing search queries as necessary. Each query retrieves relevant snippets, which are summarized and contribute to generating a preliminary response. The response is then verified for relevance and grounding, refining it further before producing the final answer.

The agent relies on pre-designed, few-shot prompts formatted in Python code to guide reasoning tasks, ensuring outputs are structured and integration-ready. This format further aids in models' understanding and predictions, particularly when a structural and descriptive format like Python is employed.

Experimental Setup

The experiments were conducted using PaLM 2 models of various sizes (XS, S, and L), with the agent evaluated on datasets like Bamboogle and BamTwoogle. These datasets contain compositional, multi-hop reasoning questions unanswerable through direct search queries alone. The evaluation employed auto-evaluation mechanisms aligned with human judgments, utilizing a distinct dataset to maintain training and evaluation separation.

Results

The experiments revealed that the proposed self-improvement strategy significantly enhances model performance. After two iterations, performance comparable to much larger pre-trained models was achieved with a substantially smaller model. This illustrates the efficacy of process-driven, feedback-based training in resource-constrained setups. Both auto-evaluation and human evaluation confirmed these gains, with notable improvements in accuracy over successive iterations.

Applying self-critique mechanisms demonstrated a minor boost in performance, highlighting the utility of iterative validation within multi-step reasoning contexts. However, fine-tuning on human-filtered data showed marginal benefits, suggesting the algorithm's robustness against noisy data.

Implications and Future Work

The methodology has practical implications for scalable AI systems, where human-labeling is infeasible or costly. By leveraging synthetic data for iterative model refinement, LLM agents can progressively improve without extensive labeled datasets. Future research might expand this framework to incorporate more diverse toolsets and examine the scalability of iterative improvements with additional model generations.

Conclusion

This work presents a feasible and effective approach to enhancing LLM agents' reasoning capabilities via self-improvement. By integrating ReAct and ReST methodologies under a structured, feedback-oriented training regime, the research illustrates a path toward more autonomous, efficient, and scalable LLMs capable of complex multi-step reasoning. This model paves the way for models demanding less human oversight while simultaneously improving their interpretative and interactive competences.

Youtube Logo Streamline Icon: https://streamlinehq.com