ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent (2312.10003v1)

Published 15 Dec 2023 in cs.CL

Abstract: Answering complex natural language questions often necessitates multi-step reasoning and integrating external information. Several systems have combined knowledge retrieval with a LLM to answer such questions. These systems, however, suffer from various failure cases, and we cannot directly train them end-to-end to fix such failures, as interaction with external knowledge is non-differentiable. To address these deficiencies, we define a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We further refine the agent through a ReST-like method that iteratively trains on previous trajectories, employing growing-batch reinforcement learning with AI feedback for continuous self-improvement and self-distillation. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model that achieves comparable performance on challenging compositional question-answering benchmarks with two orders of magnitude fewer parameters.

References (26)

Citations (34)

View on Semantic Scholar

Summary

The paper introduces a novel method combining ReST and ReAct to enable LLMs to self-improve via iterative reasoning and AI feedback.
The methodology leverages synthetic data and reinforcement learning to refine multi-step reasoning while reducing reliance on human-labeled data.
Evaluations on specialized datasets demonstrate that the self-trained agent effectively tackles compositional questions that standard search engines cannot resolve.

Introduction to the Concept and Approach

The paper presents an enhanced approach to answering complex natural language questions requiring multi-step reasoning and external data sourcing. Substantial advancements have involved integrating knowledge retrieval with LLMs to handle such questions. Unfortunately, these systems exhibit limitations and are not directly trainable end-to-end to rectify these shortcomings. Consequently, the authors introduce a technique that enriches an LLM with the capacity to reason and interact with external knowledge sources. This system is further polished using a ReST-like training protocol that iteratively self-trains on past trajectories, combining reinforcement learning with AI feedback for ongoing self-improvement and self-distillation.

Underlying Agent Architecture

The work is rooted in the ReAct method, combining chain-of-thought reasoning with action and observation in multiple rounds. Here, the Search Agent is tailored with prompts spawning long-form, traceable answers. Challenges lie in refining the agent's robustness and efficacy, which commonly involves acquiring extensive human-labeled data—a process fraught with difficulties. The paper leverages a self-critical method, exploiting AI feedback and synthetic data to enhance the agent's capabilities, diverging from traditional reliance on human-labeled training data.

Improved Training via Self-Improvement Loop

An essential aspect is the application of the ReST algorithm in agent scenarios: the dataset is expanded by sampling from recent policies, and the policy improves through a fixed dataset with a model used as a ranking tool. This is signified by multi-step trajectories culminating in complete assessments and AI-powered direct rankings. The agent's prowess is gauged by its ability to tackle compositional questions that evade simple search engines. Through this iterative process, a large model is fine-tuned, and comparatively less resource-intensive models achieve similar performance, furnishing evidence for the self-improvement and self-distillation capabilities of the method.

Evaluating Agent Performance

The paper adopts two primary datasets, Bamboogle and BamTwoogle, to evaluate the Search Agent. Both datasets consist of questions intentionally crafted to be unsolvable by standard search engines, each requiring various searches for accurate responses. This task serves as a testbed for the agent's effectiveness through human and automated evaluations. The synergy between the iterative training process, AI feedback, and careful pacing of training iterations paves the way for models that show improvements without human data intervention, a significant step forward in the autonomous enhancement of LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fullstacksapien/status/1769962985122914431

https://twitter.com/67970687/status/1742825327754121638

https://twitter.com/22146921/status/1736868470317625697

https://twitter.com/1438897430964097032/status/1736760838210429238

https://twitter.com/448331393/status/1736740806684237908

https://twitter.com/18364654/status/1738765765493375082

YouTube

Show All Videos

HackerNews

ReST Meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent (3 points, 1 comment)