Emergent Mind

Large Language Models can Implement Policy Iteration

(2210.03821)

Published Oct 7, 2022 in cs.LG

Abstract

This work presents In-Context Policy Iteration, an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the "few-shot" quality that made in-context learning attractive to begin with. In this work, we present an algorithm, ICPI, that learns to perform RL tasks without expert demonstrations or gradients. Instead we present a policy-iteration method in which the prompt content is the entire locus of learning. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our algorithm using Codex, a language model with no prior knowledge of the domains on which we evaluate it.

Overview

The paper explores an alternative RL approach using in-context learning capabilities of LMs for policy iteration, avoiding expert demonstrations and gradient-based optimization.
In-context learning is utilized to iteratively update prompts, acting as both a world-model and rollout-policy, refining policies through self-improvement sans gradients or demonstrations.
A new method called In-Context Policy Iteration (ICPI) is introduced, which relies on prompt updates and Q-value estimates for policy improvement in RL tasks.
Empirical validation was performed on six RL tasks using pre-trained LLMs like GPT-J, OPT-30B, and Codex variants, with larger models showing more consistent learning capacities.
The study demonstrates the use of LLMs' in-context learning to iterate policies in RL, suggesting a move towards expert-agnostic approaches and the potential for broader and more complex applications.

Introduction

The integration of Reinforcement Learning (RL) and foundation models such as Language Models (LMs) has led to fascinating developments. The primary strategies for RL application on these models generally consist of leveraging curated expert demonstrations or adapter layers, both of which possess inherent drawbacks. This paper presents an alternative approach that employs in-context learning capabilities of LMs for policy iteration in RL tasks, eliminating the need for expert demonstrations or gradient-based optimization methods.

Related Work

Existing RL applications fall into two categories: leveraging expert demonstrations or relying on gradient-based methods such as transformer models. Expert demonstrations often lack the capacity to outperform the experts from whom the demonstrations were derived. Gradient-based methods, whilst powerful, abandon the appealing properties of foundation models that enable learning without direct task-specific training. The proposed method navigates these constraints using in-context learning, demonstrated by showing the method's effectiveness across different LLMs for a variety of simple RL tasks.

Methodology

The presented In-Context Policy Iteration (ICPI) method iteratively updates the prompt content in RL environments, thus inducing the role of a world-model and a rollout-policy solely through in-context learning. The policy is improved by acting in the environment to maximize the Q-value estimates obtained from the LM-generated rollouts. This self-improvement attribute of ICPI enables it to refine policies iteratively, dispensing with the need for gradients and expert demonstrations. Additionally, the paper detailed the prompt construction and the method for computing Q-values relying on the accumulated experience within the model's context window.

Experiments and Results

The approach was empirically validated on six illustrative RL tasks, demonstrating the capability of ICPI to learn policies rapidly. Additionally, different pre-trained LLMs, including GPT-J, OPT-30B, and variants of Codex, were tested to investigate the impact of model size and domain knowledge. The experiments revealed that larger models, in particular, the code-davinci-001 variant of Codex, consistently demonstrated learning. It was also notable that the models' ability to generate rollouts reflective of their task-specific logic was crucial to learning success.

Conclusion

The paper introduces a significant stride in RL by leveraging the in-context learning capabilities of LLMs to iterate policies without expert demonstrations or training model parameters. It offers an architecture- and expert-agnostic approach to RL, highlighting the potential of large LLMs to generalize and adapt to diverse RL tasks. The empirical results may be preliminary but the concept implies a promising avenue leveraging the ever-increasing capabilities of foundation models, opening doors to more complex and varied applications as LLMs evolve.