Emergent Mind

GUICourse: From General Vision Language Models to Versatile GUI Agents

(2406.11317)
Published Jun 17, 2024 in cs.AI , cs.CL , cs.CV , and cs.HC

Abstract

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

Qwen-VL-Chat struggles with OCR and GUI tasks; Qwen-GUI excels in these areas.

Overview

  • The paper introduces GUICourse, a dataset suite designed to enhance Vision Language Models (VLMs) for GUI navigation, addressing challenges like OCR limitations and insufficient GUI-specific knowledge.

  • It presents three main datasets—GUIEnv, GUIAct, and GUIChat—comprising annotated web pages, action instructions, and QA pairs to bolster text recognition, interaction understanding, and conversational skills.

  • Experimental validation demonstrates significant performance improvements in GUI agents trained on GUICourse, highlighting practical advances in GUI navigation and interaction capabilities.

Overview of "GUICourse: From General Vision Language Model to Versatile GUI Agent"

The paper presents GUICourse, a comprehensive suite of datasets aimed at enhancing the capabilities of Vision Language Models (VLMs) to function as versatile agents for GUI navigation tasks. The presented work addresses critical challenges in current VLMs, such as limited OCR and grounding capabilities, as well as insufficient GUI-specific knowledge, which are essential for developing practical GUI agents.

Contributions

The key contributions are multifaceted:

GUIEnv Dataset: This extensive dataset is designed to bolster the OCR and grounding abilities of VLMs. It includes 10 million pairs of website pages with annotations for pre-training and an additional 0.7 million region-text QA pairs for supervised fine-tuning (SFT). GUIEnv is split into two components:

  • GUIEnv-global: Entire page screenshots annotated with detailed textual content and layout information.
  • GUIEnv-local: Focused QA pairs about specific regions, aiming at refining text recognition and element localization.

GUIAct Dataset: This dataset enriches VLMs' understandings of GUI components and their interactions, fostering an improved knowledge of GUI controls. It spans multiple scenarios—web and smartphone—and comprises:

  • Web-Single: 67,000 instructions for single-step actions.
  • Web-Multi: 44,000 instructions for multi-step actions.
  • Smartphone: 67,000 instructions adapted from a subset of the AITW dataset.

GUIChat Dataset: This dataset emphasizes interaction capabilities, featuring 44,000 single-turn QA pairs and 6,000 multi-turn dialogues grounded in text-rich imagery from web pages. It is crafted to enhance conversational skills and context understanding in GUI agents.

Experimental Validation

The efficacy of these datasets was validated through rigorous experiments. The authors trained several GUI agents—Qwen-GUI, Fuyu-GUI, and MiniCPM-GUI—on top of existing VLMs (Qwen-VL, Fuyu-8B, and MiniCPM-V), and evaluated them on established GUI navigation benchmarks:

Mind2Web: This dataset tests high-level GUI navigation tasks across diverse web environments. The results illustrated significant improvements in step success rates and other metrics upon incorporating the GUICourse datasets.

  • For instance, Qwen-GUI demonstrated notable performance enhancements over its baseline Qwen-VL, particularly in step success rates (StepSR).

AITW: This dataset focuses on smartphone navigation tasks, featuring simpler scenarios compared to Mind2Web. Here too, the GUI agents trained on GUICourse outperformed their baseline counterparts.

Ablation Studies

The authors conducted detailed ablation studies on the MiniCPM-GUI agent. Several factors were analyzed:

  • Impact of Different Amounts of GUIEnv Data: The study found a positive correlation between the volume of GUIEnv data and the performance improvements in OCR and grounding tasks.
  • Addition of High-Resolution Data: Improving the image resolution substantially enhanced the performance on GUI navigation tasks.
  • Incorporation of GUIChat Data: Mixing GUIChat data during training further improved the agents' performance, particularly in complex, multi-step web navigation tasks.

Theoretical and Practical Implications

The advancements presented in this paper have profound implications for both practical applications and theoretical research in AI. The practical significance is twofold:

  1. Enhanced GUI Navigation: By addressing fundamental OCR and grounding issues, the datasets enable the development of more robust and reliable GUI agents capable of handling diverse scenarios.
  2. Improved Interaction Capabilities: The inclusion of GUIChat facilitates more natural and effective conversational interactions, broadening the utility of GUI agents.

Theoretically, the work underscores the importance of domain-specific datasets in fine-tuning general models for specialized tasks. The success of GUICourse demonstrates how targeted data can substantially elevate the performance of general VLMs in niche areas.

Future Developments

The paper opens avenues for several future research directions:

  1. Scaling GUI Knowledge: Extending GUIAct to incorporate more diverse and complex GUI systems, including desktop environments and specialized software, could further improve the versatility of GUI agents.
  2. Reinforcement Learning: Integrating reinforcement learning techniques, such as RLHF, could optimize the agents' decision-making processes, enhancing their efficiency and robustness.
  3. Ethical Dataset Development: Ensuring the exclusion of any potentially offensive or inappropriate content during dataset curation remains an ongoing challenge, especially for large-scale web data.

In summary, GUICourse represents a significant advancement in utilizing VLMs for practical GUI navigation tasks. By enhancing OCR and grounding capabilities, enriching GUI-specific knowledge, and improving interaction potentials, it sets a robust foundation for the future development of versatile and intelligent GUI agents.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.