Emergent Mind

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

(2401.13649)
Published Jan 24, 2024 in cs.LG , cs.CL , and cs.CV

Abstract

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.

Benchmarking LLM and VLM agents on 910 diverse web navigation and action execution tasks.

Overview

  • The paper introduces VisualWebArena, a new benchmark with 910 tasks to evaluate multimodal agents on visually and textually complex web tasks across environments like classifieds, shopping, and forums.

  • It comprehensively evaluates state-of-the-art language and vision-language models, identifying performance gaps and challenges, and introduces a new Vision-Language Model (VLM) that outperforms traditional models.

  • Experiments show significant performance gains with multimodal agents, but highlight the need for future advancements in multimodal understanding and reasoning capabilities.

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Overview

"VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks" introduces a new benchmark designed to evaluate the performance of multimodal LLMs and vision-language models (VLMs) on tasks that require both visual and textual comprehension. The majority of existing benchmarks focus on text-based agents, disregarding the importance of visual information for many natural computer tasks. VisualWebArena aims to address this disparity by incorporating tasks necessitating the processing of image-text inputs, interpreting natural language instructions, and executing actions on visually rich web interfaces.

The benchmark comprises 910 tasks across diverse and visually complex environments such as classifieds, shopping, and social forums. The authors conduct an extensive evaluation of state-of-the-art LLM and VLM agents to highlight the performance gaps and challenges in current multimodal models. VisualWebArena promises to be a significant step towards developing stronger and more versatile autonomous agents capable of better mimicking real human-computer interactions.

Key Contributions

Introduction of VisualWebArena Benchmark:

  • Contains 910 tasks across three major web environments: Classifieds, Shopping, and Reddit.
  • Tasks are designed to be visually grounded, requiring thorough visual understanding and image-text input processing for completion.
  • Approximately 25% of tasks include specific input images that need to be interpreted by the agent.

Comprehensive Evaluation of State-of-the-Art Models:

  • The authors benchmark several state-of-the-art LLMs and VLMs, demonstrating their performance on visual and text-based tasks.
  • Incorporation of models like GPT-4V, Gemini, IDEFICS, etc., to analyze multimodal capabilities.
  • Identification of significant performance gaps between API-based VLMs and open-source VLM agents.

Development of a New VLM Agent:

  • Inspired by Set-of-Marks prompting, a preprocessing step annotates every interactable webpage element with a unique ID, simplifying the action space.
  • Empirical results show that this model outperforms traditional LLM agents, particularly on visually complex sites.

Experimental Setup

The experiments were carried out in environments modeled as partially observable Markov decision processes (POMDPs). Agents are required to navigate these environments and perform tasks using a defined set of actions such as clicking, typing, scrolling, and more. The visual inputs comprised raw HTML, webpage screenshots, accessibility trees, and Set-of-Marks (SoM) representations.

Results

  • Performance of Text-Based LLMs:
  • The best-performing text-only LLM, GPT-4, achieved a success rate of 7.25%.
  • Text-based models see considerable improvement when augmented with image captions, with GPT-4's success rate increasing to 12.75%.
  • Importance of Multimodality:
  • The use of multimodal agents leads to substantial performance gains. For example, GPT-4V achieved an overall success rate of 15.05%.
  • Gemini-Pro's success rate increased from 3.85% (caption-augmented) to 6.04% (multimodal).
  • Effectiveness of Set-of-Marks Representation:
  • The SoM representation further improved GPT-4V's success rate from 15.05% to 16.37%, highlighting its potential for simplifying action spaces in visually dense environments.
  • Human Performance:
  • In comparison, human performance recorded a success rate of 88.7%, establishing a significant benchmark for autonomous agents.

Implications and Future Directions

The findings underscore that existing models need considerable enhancement to effectively tackle visually grounded tasks. The lack of substantial performance on simple visual tasks suggests that future developments should focus on multimodal understanding and sophisticated reasoning capabilities.

Theoretical Implications:

  • The results illustrate the necessity of integrating visual modalities with text for comprehensive task automation.
  • The research highlights the limitations of current models, encouraging advancements in multimodal fusion techniques and reasoning frameworks.

Practical Implications:

  • Entrepreneurs and developers looking to deploy AI on user interfaces stand to gain insights into the current capabilities and limitations posed by state-of-the-art models.
  • The benchmark serves as a rigorous testbed for evaluating and developing future LLM and VLM models for real-world applications.

Future Developments:

  • Enhancing OCR capabilities within VLMs.
  • Addressing failure modes such as redundantly repetitive actions and early termination.
  • Fine-tuning existing LLMs on interaction trajectories to improve their multitasking abilities.
  • Developing more sophisticated history-tracking mechanisms to better manage complex, multistep tasks.

Conclusion: VisualWebArena represents a crucial addition to the evaluation of multimodal agents, bridging the gap between visual and textual processing capabilities. The benchmark and corresponding results challenge researchers to innovate and improve autonomous agents, making significant strides towards human-like AI for visually grounded web tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.