Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization (2308.02151v3)

Published 4 Aug 2023 in cs.CL and cs.AI

Abstract: Recent months have seen the emergence of a powerful new trend in which LLMs are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained LLM which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment. This demonstrates that using policy gradient optimization to improve language agents, for which we believe our work is one of the first, seems promising and can be applied to optimize other models in the agent architecture to enhance agent performances over time.

Citations (52)

View on Semantic Scholar

Summary

The paper introduces a method where LLM agents improve planning by integrating a retrospective module optimized via policy gradients.
It utilizes environmental rewards to adjust agent actions, overcoming limitations inherent in heuristic prompt engineering.
Experimental results, including enhanced performance on tasks like HotPotQA, validate improved decision-making and faster learning.

Introduction

LLMs have evolved into autonomous language agents, exhibiting the capability to undertake independent tasks driven by objectives, as opposed to merely answering queries. Recent advances such as ReAct, Toolformer, and Langchain have demonstrated the utilization of LLMs for autonomous decision-making through text-based outputs that can trigger API calls and operations within specific environments. While LLMs can generate text and actions that align with extensive parameter counts, most are not optimized in concert with environment-specific reward functions. The few that are, utilize verbal feedback for iterative refinement but lack compatibility with gradient-based learning, which harnesses reinforcement learning techniques. Retroformer addresses this gap by reinforcing LLMs to refine prompts through policy gradient optimization, thereby leveraging environment feedback for action plans and reflecting on prior failures to improve performance.

Related Work

Discussing the latest in autonomous language agents, Retroformer situates itself within a developing corpus of research aimed at task completion involving several stages. Previous works like Chain-of-Thought pioneered decomposing complex reasoning tasks, while approaches like ReAct harnessed these faculties of LLMs for interaction with digital environments. However, most of these models do not learn from environmental rewards, impacting their performance. Some, like Reflexion, enhance agent's skills through self-reflection but still do not utilize gradient signals explicitly. In contrast, the policy gradient optimization at the core of Retroformer enables effective planning and decision-making by learning from environment feedback.

Challenges & Intuition

Applying LLMs based agents to problems involving tool use and action presents several challenges, such as generating spurious actions, limited prompt lengths, heuristic prompt engineering, and difficulties with direct LLM optimization. Classical reinforcement learning (RL) agents, though less adept in zero-shot settings for text-rich environments, exemplify ongoing improvement through environmental feedback. Retroformer harnesses classical RL optimization, such as policy gradient algorithms, to iteratively enhance model performance, while steering clear of direct fine-tuning of the LLM, which makes it a robust yet straightforward approach for empowering agents with state or memory.

Reinforcing Retrospective Language Agent

Retroformer employs a dual component of an actor LLM and a retrospective LLM. The actor is a fixed LLM, while the retrospective one is a smaller LM refined through RL techniques. The retrospective LM is fine-tuned to provide feedback, effectively serving as prompt for the LLM agent. By integrating a policy gradient optimization method, Retroformer facilitates the learning from arbitrary reward information across multiple environments and tasks, allowing for iterative refinements that bolster agent learning speed and task performance success rates. Experimentation demonstrates Retroformer's capacity to outperform baselines on tasks like HotPotQA, showcasing the utility of gradient-based reasoning and planning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/frontier_intel/status/1857219575462711657

YouTube

Show All Videos