Emergent Mind

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

(2401.13919)
Published Jan 25, 2024 in cs.CL and cs.AI

Abstract

The rapid advancement of LLMs has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents.

WebVoyager automates web browsing tasks, analyzing content and images to return specific requested information.

Overview

  • The paper presents WebVoyager, a web agent leveraging Large Multimodal Models for interactive task completion on the web.

  • WebVoyager combines visual and textual analysis to navigate and interact with websites in a human-like manner.

  • The agent's decision-making is based on screenshots and HTML content, with a focus on contextually relevant history.

  • An innovative benchmark evaluating WebVoyager involved tasks from 15 different websites, aiming to mirror real-world exploration.

  • The findings suggest that multimodal web agents outperform text-only methods, highlighting the potential for more advanced, autonomous web applications.

Introduction

The paper "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models" represents a significant stride in the development of web agents enabled by LLMs. This work targets the construction of web-based autonomous applications capable of real-world task completion, which has been a challenge historically hindered by the one-dimensionality of previous web agents and their evaluations within constrained environments. WebVoyager embodies a new incarnation of web agent, harnessing Large Multimodal Models (LMMs) to interpret and interact with websites as a human would, relying on visual renderings and textual cues.

Methodology

WebVoyager’s design philosophy leverages both the visual and textual constitution of web pages for its decision-making process. The foundation of its operation lies in analyzing screenshots – dressed with numerical labels identifying actionable elements – to intuit the user's navigation targets and paths. The input modality embraces key textual content from HTML elements to supplement visual cues, supplying a comprehensive understanding of the environment. Context clipping retains only the most recent interaction snapshots, maintaining relevant history while avoiding clutter in the agent's working memory. The agent's browsing capabilities are encapsulated into a set of executable actions, including clicking, typing, scrolling, and waiting – which adhere strictly to the given formatting protocols to maintain consistency and prevent extraneous interactions.

Evaluation

To assess WebVoyager, a novel benchmark comprising tasks sourced from 15 diverse websites is constructed. This benchmark is designed to evaluate end-to-end task completion by web agents in a real-world setting. In contrast to historical benchmarks that focused on predetermined pathways, WebVoyager's performance is gauged through its ability to navigate autonomously online, thereby capturing the dynamic, open-ended nature of web exploration. Human evaluators scrutinize task success, while GPT-4V is suggested as an automated alternative evaluator, displaying promising consistency with human judgment and offering a scalable solution for future applications.

Findings and Future Work

WebVoyager achieves a task success rate that significantly surpasses both the text-only approach and the GPT-4 integrated tool-based paradigm. This underscores the impression that multimodal processing affords a substantial edge in the fluid web navigation space. Examining the error breakdown reveals notable areas for future advancement. Issues such as navigation quagmires, visual grounding hiccups, hallucinatory responses, and prompt misalignment elucidate opportunities for refining WebVoyager. Further research could explore improved fusion techniques for visual and textual data and enhanced robustness to the complexities posed by the ever-evolving web.

In conclusion, WebVoyager emerges as a pioneering web agent capable of translating the multifaceted nature of web interactivity into automated task performance, representing a leap toward realizing more autonomous, intuitive, and versatile web-based applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube