Emergent Mind

Abstract

Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by LLMs have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.

TravelPlanner employs language agents to collect data and create plans meeting user needs and commonsense.

Overview

  • TravelPlanner is introduced as a benchmark for evaluating language agent's planning abilities, especially in travel planning contexts.

  • The benchmark makes use of almost 4 million data points and evaluates against various explicitly and implicitly stated planning intents.

  • Large language models, such as GPT-3.5, GPT-4, and Gemini have shown progress but still struggle with TravelPlanner's complex tasks.

  • Agents often fail to create a plan that meets all constraints, with a success rate of only 0.6% on the TravelPlanner benchmark.

  • The paper calls for future research to improve language agents’ planning competencies to match the capabilities of human planners.

Introduction

The development of AI agents capable of human-like planning has been a longstanding goal in the field of AI. TravelPlanner is introduced as a benchmark to assess the capabilities of language agents in complex real-world planning scenarios, particularly focusing on travel planning. This benchmark evaluates agents' performances against meticulously curated planning intents and their ability to utilize nearly four million data records across various tools.

Related Work

Recent breakthroughs have seen LLMs playing a pivotal role in improving language agents. The emergence of models such as GPT-3.5, GPT-4 and Gemini has endowed these agents with capabilities such as advanced memory, tool use, and strategic planning. This has led to significant enhancements in their general problem-solving abilities. Studies demonstrate these agents' competencies in utilizing long-term parametric memory and short-term working memory along with their interactions with external environments through API calls.

TravelPlanner: A Novel Planning Benchmark

TravelPlanner centers around the theme of travel planning – a multidimensional task involving long-horizon predictions and numerous constraints like budget and accommodation preferences. The benchmark challenges agents to construct multi-day itineraries that adhere to a combination of explicit and implicit constraints. The complexity of TravelPlanner is further increased by incorporating environmental dynamics that require the agent to adjust plans according to real-time feedback.

Evaluation Findings

Empirical evaluations using different models like GPT-4 and various planning strategies demonstrate a significant gap between current LLMs' capacities and the requirements of TravelPlanner. Even sophisticated agents struggle with a success rate of merely 0.6%, indicating that while the agents can handle some constraints, they often fail to synthesize a coherent plan that adheres to all at once. Common failure modes include errors in utilizing tools effectively, falling into dead loops due to incorrect actions, and producing hallucinated responses when information is missing or confusing.

Conclusion

TravelPlanner exemplifies a rigorous test for examining language agents' capacity for context-aware planning under realistic and complex constraints. The benchmark signifies the nascent stages of AI's planning abilities and underscores the considerable work necessary to reach human-level adeptness in planning tasks. Future research inspired by TravelPlanner will be instrumental in exploring more sophisticated strategies, enabling agents to handle multiple constraints and long-horizon tasks with the finesse that human planners do.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube