WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models (2401.13919v4)

Published 25 Jan 2024 in cs.CL and cs.AI

Abstract: The rapid advancement of LLMs has led to a new era marked by the development of autonomous applications in real-world scenarios, which drives innovation in creating advanced web agents. Existing web agents typically only handle one input modality and are evaluated only in simplified web simulators or static web snapshots, greatly limiting their applicability in real-world scenarios. To bridge this gap, we introduce WebVoyager, an innovative Large Multimodal Model (LMM) powered web agent that can complete user instructions end-to-end by interacting with real-world websites. Moreover, we establish a new benchmark by compiling real-world tasks from 15 popular websites and introduce an automatic evaluation protocol leveraging multimodal understanding abilities of GPT-4V to evaluate open-ended web agents. We show that WebVoyager achieves a 59.1% task success rate on our benchmark, significantly surpassing the performance of both GPT-4 (All Tools) and the WebVoyager (text-only) setups, underscoring the exceptional capability of WebVoyager. The proposed automatic evaluation metric achieves 85.3% agreement with human judgment, indicating its effectiveness in providing reliable and accurate assessments of web agents.

Citations (75)

View on Semantic Scholar

Summary

The paper introduces WebVoyager, a novel web agent that integrates visual and textual modalities to perform autonomous web tasks.
The methodology employs context clipping, numerical screenshot labels, and HTML content analysis to enable robust decision-making in dynamic environments.
The evaluation demonstrates that WebVoyager outperforms text-only agents with higher task success rates and scalable performance assessment using GPT-4V.

Introduction

The paper "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models" represents a significant stride in the development of web agents enabled by LLMs. This work targets the construction of web-based autonomous applications capable of real-world task completion, which has been a challenge historically hindered by the one-dimensionality of previous web agents and their evaluations within constrained environments. WebVoyager embodies a new incarnation of web agent, harnessing Large Multimodal Models (LMMs) to interpret and interact with websites as a human would, relying on visual renderings and textual cues.

Methodology

WebVoyager’s design philosophy leverages both the visual and textual constitution of web pages for its decision-making process. The foundation of its operation lies in analyzing screenshots – dressed with numerical labels identifying actionable elements – to intuit the user's navigation targets and paths. The input modality embraces key textual content from HTML elements to supplement visual cues, supplying a comprehensive understanding of the environment. Context clipping retains only the most recent interaction snapshots, maintaining relevant history while avoiding clutter in the agent's working memory. The agent's browsing capabilities are encapsulated into a set of executable actions, including clicking, typing, scrolling, and waiting – which adhere strictly to the given formatting protocols to maintain consistency and prevent extraneous interactions.

Evaluation

To assess WebVoyager, a novel benchmark comprising tasks sourced from 15 diverse websites is constructed. This benchmark is designed to evaluate end-to-end task completion by web agents in a real-world setting. In contrast to historical benchmarks that focused on predetermined pathways, WebVoyager's performance is gauged through its ability to navigate autonomously online, thereby capturing the dynamic, open-ended nature of web exploration. Human evaluators scrutinize task success, while GPT-4V is suggested as an automated alternative evaluator, displaying promising consistency with human judgment and offering a scalable solution for future applications.

Findings and Future Work

WebVoyager achieves a task success rate that significantly surpasses both the text-only approach and the GPT-4 integrated tool-based paradigm. This underscores the impression that multimodal processing affords a substantial edge in the fluid web navigation space. Examining the error breakdown reveals notable areas for future advancement. Issues such as navigation quagmires, visual grounding hiccups, hallucinatory responses, and prompt misalignment elucidate opportunities for refining WebVoyager. Further research could explore improved fusion techniques for visual and textual data and enhanced robustness to the complexities posed by the ever-evolving web.

In conclusion, WebVoyager emerges as a pioneering web agent capable of translating the multifaceted nature of web interactivity into automated task performance, representing a leap toward realizing more autonomous, intuitive, and versatile web-based applications.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1750707472703045724

https://twitter.com/arankomatsuzaki/status/1750708523455918494

https://twitter.com/LangChainAI/status/1754918056273576340

https://twitter.com/gwolfadam/status/1754938205034709368

https://twitter.com/BrianRoemmele/status/1750895067739677070

https://twitter.com/sebkrier/status/1753937763089465528

YouTube

Show All Videos