Emergent Mind

Tree Search for Language Model Agents

(2407.01476)
Published Jul 1, 2024 in cs.AI , cs.CL , and cs.LG

Abstract

Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at https://jykoh.com/search-agents.

VWA shopping task \#96: search enables the agent to prune trajectories and succeed where the baseline fails.

Overview

  • The paper introduces a novel best-first tree search algorithm tailored for language model (LM) agents, enabling enhanced multi-step planning and exploration in real-time interactive environments.

  • The proposed method integrates a model-based value function to guide the search process, utilizing multimodal LMs for finer-grained decision-making based on an agent’s observations.

  • Empirical evaluations demonstrate significant performance improvements on benchmarks like VisualWebArena (VWA) and WebArena (WA), with relative success rate increases of 39.7% and 28.0% respectively when applied to a GPT-4o agent.

Tree Search for Language Model Agents

The publication "Tree Search for Language Model Agents" by Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov explores a novel inference-time search algorithm for enhancing the decision-making capabilities of language model (LM) agents, particularly in interactive web environments. This approach addresses key limitations of LMs in handling multi-step reasoning, planning, and effectively using environmental feedback.

Key Contributions

  1. Best-First Tree Search: The authors propose the first tree search algorithm tailored for LM agents that operates within the actual environment space. This search algorithm enhances exploration and multi-step planning by constructing, exploring, and pruning intermediate states and possible solutions dynamically during inference.

  2. Value Function Integration: The proposed method incorporates a model-based value function to guide best-first search. The value function marginalizes over reasoning chains of a multimodal LM conditioned on the agent's observations, providing finer-grained scores that effectively guide the search.

  3. Demonstrated Effectiveness: Evaluated on the challenging VisualWebArena (VWA) and WebArena (WA) benchmarks, the tree search algorithm significantly boosts the performance of language model agents. Specifically, the application of the search algorithm on top of a GPT-4o agent results in a 39.7% relative increase in success rate on VWA and a 28.0% relative improvement on WA.

Numerical Results

  • VisualWebArena (VWA):
  • Baseline GPT-4o + SoM agent success rate: 18.9%
  • With search algorithm applied: 26.4%
  • Relative improvement: 39.7%
  • WebArena (WA):
  • Baseline GPT-4o agent success rate: 15.0%
  • With search algorithm applied: 19.2%
  • Relative improvement: 28.0%

Detailed Methodology

Search Algorithm:

  • Initialization:
  • The algorithm starts from a given state, initializing the search with parameters such as depth (d), branching factor (b), and budget (c).
  • Iteration and Node Expansion:
  • At each iteration, the algorithm employs the value function to evaluate the current state and expands the state into multiple child states based on plausible actions proposed by the LM. This process involves executing actions in the simulator to get new states and computing their value scores.
  • Best-First Search:
  • The algorithm maintains a priority queue for the frontier states and continuously updates the best state found according to the value function until the budget is exhausted or a near-optimal state is identified.

Value Function:

  • The value function f_v estimates the expected reward of the current state by considering the current and previous observations alongside the task instructions. It utilizes a multimodal LM to process the visual and textual information present in these observations.

Theoretical and Practical Implications

The proposed tree search algorithm enhances the robustness and efficiency of LM agents in performing web-based tasks, making them capable of handling complex, multi-step planning scenarios more effectively. This can lead to practical improvements in applications such as web automation and user-interactive AI systems. Moreover, the compatibility of the method with existing LM agents without requiring retraining facilitates easy integration.

Speculations on Future Developments

The success of this search algorithm opens avenues for further exploration into scalable and efficient search strategies tailored for LMs. Future research could focus on optimizing the computational efficiency of the search process, integrating more sophisticated value functions, and applying the methodology to other domains requiring complex decision-making and planning capabilities.

Potential developments could also include:

  • Enhanced world models that simulate possible outcomes to reduce the need for extensive real-world exploration.
  • Improved search heuristics that dynamically adjust search parameters based on task complexity.
  • Expanding the applicability of tree search to offline settings and tasks with offline evaluators.

In summary, the publication presents a compelling and methodically robust approach that significantly advances the capabilities of language model agents in interactive environments through an innovative tree search algorithm. This work not only showcases substantial empirical improvements but also sets a strong foundation for future research in AI planning and decision-making in intricate settings.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.