Emergent Mind

Sequential Planning in Large Partially Observable Environments guided by LLMs

(2312.07368)
Published Dec 12, 2023 in cs.AI and cs.RO

Abstract

Sequential planning in large state space and action space quickly becomes intractable due to combinatorial explosion of the search space. Heuristic methods, like monte-carlo tree search, though effective for large state space, but struggle if action space is large. Pure reinforcement learning methods, relying only on reward signals, needs prohibitively large interactions with the environment to device a viable plan. If the state space, observations and actions can be represented in natural language then Large Language models (LLM) can be used to generate action plans. Recently several such goal-directed agents like Reflexion, CLIN, SayCan were able to surpass the performance of other state-of-the-art methods with minimum or no task specific training. But they still struggle with exploration and get stuck in local optima. Their planning capabilities are limited by the limited reasoning capability of the foundational LLMs on text data. We propose a hybrid agent "neoplanner", that synergizes both state space search with queries to foundational LLM to get the best action plan. The reward signals are quantitatively used to drive the search. A balance of exploration and exploitation is maintained by maximizing upper confidence bounds of values of states. In places where random exploration is needed, the LLM is queried to generate an action plan. Learnings from each trial are stored as entity relationships in text format. Those are used in future queries to the LLM for continual improvement. Experiments in the Scienceworld environment reveals a 124% improvement from the current best method in terms of average reward gained across multiple tasks.

Overview

  • The paper introduces 'neoplanner', a hybrid AI agent that combines traditional search with LLMs for planning in partially observable environments.

  • 'Neoplanner' uses POMDP models, RL principles, and LLM suggestions to iteratively build and refine an environmental model and action plans.

  • The agent maintains a graph-based state space, balancing exploration/exploitation and incorporating natural language processing.

  • Experimental results show that 'neoplanner' outperforms state-of-the-art methods in the Scienceworld environment.

  • 'Neoplanner' demonstrates effective navigation in POMDPs by integrating LLMs' nuanced language abilities and state space exploration.

Introduction

Sequential planning in high-dimensional, partially observable environments poses significant challenges for artificial intelligence. Traditional algorithms quickly become intractable when faced with large state and action spaces. Reinforcement learning (RL) approaches, while useful to some extent, are limited by their reliance on reward signals to develop viable plans. Generative AI approaches, and in particular, LLMs, open new possibilities in this landscape by generating action plans through natural language. They encapsulate real-world knowledge acquired through extensive pre-training. However, they fall short in their planning capabilities due to limited reasoning about actions over text and struggle with exploration. This paper introduces "neoplanner," a hybrid agent that synergizes state space search with queries to foundational LLMs to formulate better action plans.

Approach

Neoplanner assesses the environment through the lens of a deterministic Partially Observable Markov Decision Process (POMDP) and leverages both RL and generative AI strategies. The agent iteratively builds a model of the environment and refines its understanding of state transitions and rewards. During this process, it uses LLMs to suggest action plans where random exploration is needed, progressively improving the quality of suggestions with each iteration through its growing text-encoded environmental knowledge.

State Space Graph-Based Planning

The developed agent maintains a balance between exploration and exploitation by constructing and updating a state space graph using observed interactions with the environment. It assigns values to states based on the learned policy and augments them with an exploration term, following the principle of Upper Confidence Bounds (UCB1). To address large unexplored state spaces, the agent relies on LLMs to fill in gaps in the action plan, thereby damping the search space explosion typically experienced with pure RL approaches.

Experimental Results

Experiments conducted within the Scienceworld environment reveal that neoplanner achieves a substantial increase in performance compared to existing state-of-the-art methods. Utilizing natural language as a medium for expressing the state space and action possibilities allows the LLM within the agent to craft relatively accurate shallow plans, which are then honed through actual interaction with the digital world.

Conclusion

Neoplanner's hybrid design affords it advantages not present in pure RL or LLM-based methods alone. By effectively combining the global optimization capabilities inherent in state space exploration with the nuanced natural language processing abilities of LLMs, neoplanner succeeds in navigating vast POMDPs efficiently. The continuous integration of experiential knowledge further sharpens the agent's capacity to formulate and execute sophisticated plans, demonstrating a remarkable improvement over current techniques and solidifying its potential for complex problem-solving in dynamic environments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube