Emergent Mind

OmniParser for Pure Vision Based GUI Agent

(2408.00203)
Published Aug 1, 2024 in cs.CV , cs.AI , cs.CL , and cs.LG

Abstract

The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce \textsc{OmniParser}, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. \textsc{OmniParser} significantly improves GPT-4V's performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, \textsc{OmniParser} with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.

Parsed screenshot image with bounding boxes and local semantics including text and icon descriptions.

Overview

  • The paper introduces OmniParser, a vision-based screen parsing technique to enhance interaction accuracy of large vision-language models (e.g., GPT-4V) with user interfaces.

  • OmniParser was validated using a large dataset and fine-tuned detection models, achieving significant performance improvements in various benchmarks such as ScreenSpot, Mind2Web, and AITW.

  • The methodology integrates detection models, OCR, and captioning to provide detailed descriptions of UI elements, boosting the accuracy of action predictions by VL models.

OmniParser for Pure Vision Based GUI Agent: An Overview

The paper "OmniParser for Pure Vision Based GUI Agent" presents a method to enhance the capabilities of large vision-language models (VL models), specifically GPT-4V, in tasks that involve interaction with user interfaces (UI) across various platforms. The research addresses two fundamental challenges: the reliable identification of interactable icons within UIs and the accurate association of intended actions with specific regions of the screen. To this end, the authors introduce OmniParser, a vision-based screen parsing technique designed to enhance the robustness and accuracy of action predictions in user interfaces.

Key Contributions

  1. Interactable Region Detection Dataset: The authors curated a dataset using popular webpages, providing bounding boxes for interactable regions derived from the Document Object Model (DOM) tree of each webpage. This dataset consists of 67,000 unique screenshots, enabling the fine-tuning of a detection model specifically for interactable icons.
  2. Fine-tuned Detection Models: OmniParser utilizes a fine-tuned detection model for parsing interactable regions on the screen and a caption model for extracting functional semantics of detected elements. This combination enables the VL models to better understand and predict actions grounded in UI screens.
  3. Evaluation on Benchmarks: The effectiveness of OmniParser was demonstrated through evaluations on several benchmarks including ScreenSpot, Mind2Web, and AITW. The results indicated significant improvements over existing methods, particularly in scenarios requiring only visual input from the screenshot.

Methodology

OmniParser's architecture integrates three main components: an interactable region detection model, an Optical Character Recognition (OCR) module, and a caption model for local semantics of functionality. The detection model is fine-tuned using the curated dataset, while the OCR and captioning models provide detailed descriptions of both icons and texts within the UI. This combined structure allows for a detailed and structured representation of the UI, enhancing the ability of VL models to make precise action predictions.

Details of the Components

  1. Interactable Region Detection: Utilizing a fine-tuned YOLOv8 model, the detection process identifies clickable regions based on bounding boxes derived from the DOM tree. This model received significant performance improvements through fine-tuning, emphasizing the importance of detecting interactable elements accurately.
  2. Incorporation of Local Semantics: By integrating local semantics such as text from OCR and description of icons, OmniParser reduces the cognitive load on GPT-4V, allowing it to make more accurate action predictions. The functional description model fine-tuned on an icon-description dataset enables better context understanding and semantic reasoning.

Experimental Evaluation

SeeAssign Task

The SeeAssign task was designed to test the capability of GPT-4V in accurately identifying bounding box IDs given task descriptions. The performance improved from 0.705 to 0.938 when local semantics were incorporated, showcasing the effectiveness of the enhanced parsing method.

ScreenSpot Benchmark

On the ScreenSpot benchmark, OmniParser achieved significant gains compared to baseline models, including specialized models like SeeClick and CogAgent, even without additional HTML information. With local semantics and the detection model, OmniParser outperformed the baseline models across mobile, desktop, and web platforms.

Mind2Web Benchmark

When evaluated on the Mind2Web benchmark, OmniParser demonstrated superior performance in web navigation tasks. Notably, it surpassed models using HTML for augmenting web agents, highlighting the potential of a vision-based approach without relying on additional data outside the screenshot.

AITW Benchmark

In the AITW benchmark, OmniParser's use of a fine-tuned detect model and local semantics led to a 4.7% overall performance increase compared to GPT-4V augmented with specialized Android icon detection models. This further validates the method's efficacy across different platforms and applications.

Implications and Future Directions

The research presented in this paper implies substantial advancements in the development of general GUI agents capable of operating across diverse platforms and applications. The proposed OmniParser method not only closes the performance gap but also sets a new standard for vision-based interaction models. By moving beyond the limitations of HTML-dependent methods, OmniParser paves the way for more adaptable and robust AI agents.

Looking forward, further improvements can be envisioned through:

  • Integration of more sophisticated fine-tuning techniques for the detection and caption models.
  • Development of models that can better handle repeated elements and provide finer-grain predictions.
  • Extension of the methodology to encompass dynamic content and real-time interaction scenarios.

Overall, OmniParser significantly advances the field of vision-based GUI agents, setting a benchmark for future research and development in AI-driven user interface interaction.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

HackerNews