Emergent Mind

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

(2406.10227)
Published Jun 14, 2024 in cs.CV and cs.AI

Abstract

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descriptions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning.

VideoGUI creation pipeline: selecting videos, skill replication, annotating task elements, and manual data validation.

Overview

  • The paper introduces VideoGUI, a benchmark designed for evaluating GUI automation systems using instructional videos, enhancing the understanding of complex professional software tasks through advanced visual reasoning.

  • VideoGUI employs a three-tier hierarchical framework to assess different aspects of GUI automation, including high-level planning from visual inputs, detailed action instruction generation, and accurate execution of atomic GUI actions.

  • The empirical results from evaluating advanced multimodal models, like GPT-4o, show significant performance gaps in visually intensive tasks, underscoring challenges in visual-based planning compared to text-based queries and suggesting directions for future research in multimodal AI.

An Overview of "VideoGUI: A Benchmark for GUI Automation from Instructional Videos"

The paper "VideoGUI: A Benchmark for GUI Automation from Instructional Videos" introduces a novel benchmark designed to enhance the evaluation of Graphical User Interface (GUI) automation systems. This benchmark, named VideoGUI, is derived from high-quality instructional videos and focuses on complex tasks that necessitate sophisticated procedural knowledge and advanced visual understanding.

Key Contributions

Benchmark Design

VideoGUI sets itself apart by targeting complex and professional software environments, including but not limited to Adobe Photoshop, Premiere Pro, and Stable Diffusion WebUI. The task formulations extend beyond simple text-based instructions, requiring substantial visual reasoning. The benchmark is constructed using a three-tier hierarchical framework:

  1. High-level Planning: This involves reconstructing the sequence of procedural subtasks from visual contexts alone, without linguistic descriptions. This level reinforces the need for visual understanding in planning GUI tasks.
  2. Middle-level Planning: At this stage, the task is to generate detailed action instructions based on visual states and specific goals.
  3. Atomic Action Execution: The final tier evaluates the accuracy of executing basic GUI actions such as clicking, dragging, typing, and scrolling.

Each level of the hierarchy is meticulously designed to capture distinct aspects of GUI automation, from general planning to fine-grained action execution.

Evaluation and Metrics

The benchmark introduces comprehensive evaluation metrics tailored for each hierarchical level:

  • High-level Planning: Assessed by the ability to deduce procedural task steps purely from visual input.
  • Middle-level Planning: Evaluates the precision in generating step-by-step action descriptions.
  • Atomic Action Execution: Detailed metrics measure the accuracy of clicking (Dist and Recall@d), dragging, typing precision, and correct scrolling actions.

Empirical Results

In evaluating state-of-the-art large multimodal models (LMMs) like GPT-4o, the benchmark reveals several key insights:

  • Performance Gaps: Even the most advanced LMMs exhibit notably poor performance on visually intensive tasks, particularly at the high-level planning stage. For instance, GPT-4o struggles to complete a single full task proficiently.
  • Bottlenecks in Planning: The empirical results indicate that planning, rather than action execution, presents the primary challenge. Models perform comparatively better when working off textual queries than visual ones, underscoring the complexity of visual-centric tasks.

Practical and Theoretical Implications

Practically, the findings from VideoGUI suggest a need for more sophisticated models that can handle detailed visual reasoning and task planning. The benchmark's focus on professional software tasks highlights the potential efficiency gains through improved GUI automation in real-world applications like video editing and graphic design.

Theoretically, VideoGUI provides a fertile ground for advancing research in multimodal learning, particularly in integrating visual and textual cues for complex task completion. The rich, multi-level annotations and the diverse task set push the boundaries of current AI capabilities, suggesting directions for future research focusing on enhancing visual understanding and hierarchical planning in AI systems.

Future Directions

VideoGUI sets a new standard for evaluating GUI automation and provides a comprehensive dataset that can spur future research in AI. Possible future developments include:

  • Enhanced Visual Understanding: Research could focus on more robust models capable of understanding and planning from visual previews, potentially integrating techniques from computer vision and symbolic AI.
  • Improved Multimodal Fusion: Developing models that can effectively balance visual and textual information should improve performance on benchmarks like VideoGUI.
  • Application-Specific Optimization: Tailoring models to excel in specific applications such as video editing or digital painting could yield more practical automation tools and enhance user productivity.

In conclusion, the introduction of VideoGUI marks a significant milestone in the evaluation of GUI automation systems. By focusing on instructional videos and professional tasks that require a blend of visual and procedural understanding, the benchmark offers a rigorous platform for testing and advancing the frontiers of multimodal AI capabilities.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.