Emergent Mind

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

(2406.10819)
Published Jun 16, 2024 in cs.CV , cs.AI , and cs.CL

Abstract

Recently, Multimodal LLMs (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.

GUI-World dataset for GUI understanding, showcasing potential for real-world applications using selected screenshots.

Overview

  • The paper presents GUI-World, a new dataset designed to enhance Multimodal LLMs (MLLMs) by offering over 12,000 annotated GUI interaction videos covering diverse applications and scenarios.

  • Benchmarking shows that existing MLLMs, including models like GPT-4V, excel in static GUI understanding but struggle with dynamic and sequential tasks, emphasizing the need for improved keyframe extraction techniques.

  • A fine-tuned VideoLLM model called GUI-Vid is introduced, demonstrating significant performance improvements in understanding and interacting with dynamic GUIs through a two-phased training approach.

An Analytical Review of "GUI-World: A Dataset for GUI-oriented Multimodal LLM-based Agents"

The paper "GUI-World: A Dataset for GUI-oriented Multimodal LLM-based Agents" introduces an extensive dataset designed to enhance the capabilities of Multimodal LLMs (MLLMs) in understanding and interacting with Graphical User Interfaces (GUIs). The dataset, termed GUI-World, aims to address the primary challenges faced by current MLLMs in processing dynamic GUI content and performing multiple-step tasks across diverse GUI scenarios. The paper also explore benchmarking state-of-the-art MLLMs and fine-tuning VideoLLMs to improve their performance on GUI-oriented tasks.

Dataset Construction and Scope

The GUI-World dataset comprises over 12,000 GUI videos encompassing a variety of scenarios including software applications, websites, mobile applications (both iOS and Android), multi-window interactions, and extended reality (XR) environments. The dataset is meticulously annotated through a Human-MLLM collaborative approach, ensuring a diverse set of queries and instructions. This includes a combination of free-form questions, multiple-choice questions, and conversational queries tailored to evaluate static, dynamic, and sequential GUI content.

The data annotation process involves human annotators recording GUI interactions and keyframe extraction, which are then enhanced by LLM-generated annotations. This collaborative method ensures high-quality, comprehensive annotations that cover various GUI elements like web icons, text via OCR, and page layouts. The dataset is designed to bridge the gap between static GUI understanding and the need to handle dynamic and complex GUI tasks, which typical datasets have not addressed adequately.

Benchmarking MLLMs

The paper benchmarks several advanced MLLMs, including commercial models like GPT-4V and Gemini-Pro-1.5, as well as open-source models like Qwen-VL-Max. Despite the noted proficiency of these models in static GUI comprehension, their performance diminishes when faced with dynamic and sequential tasks. For instance, GPT-4V and GPT-4o exhibit strong performance in static content retrieval but struggle with tasks requiring an understanding of dynamic GUI changes.

Interestingly, the analysis reveals that the selection method for keyframes significantly impacts model performance. Randomly selected and human-annotated keyframes tend to yield better results compared to those extracted programmatically. This suggests that existing technologies for natural video keyframe extraction are inadequate for capturing essential GUI operations, highlighting a crucial area for future improvement.

Development of GUI-Vid

The paper introduces GUI-Vid, a fine-tuned VideoLLM model trained on the GUI-World dataset. The fine-tuning process is two-phased; the first phase aims to align basic GUI understanding through text-image pairs, while the second phase focuses on more complex tasks like sequential image reasoning and dynamic content analysis. The resulting model shows superior performance, substantially improving on baseline models and even surpassing some commercial models in specific tasks like captioning and sequential analysis.

Experimental Insights

The experiments underscore a significant finding: vision perception remains a critical component for effective sequential GUI task handling. Even though integrating detailed textual information can slightly enhance performance, the inherent ability to process and interpret visual changes within GUIs proves to be indispensable. Additionally, the study illustrates that augmenting the model with a higher number of keyframes and increased resolution enhances overall performance, pointing towards potential pathways for further advancements.

Implications and Future Prospects

The introduction of GUI-World is poised to have profound implications, both practical and theoretical. Practically, this dataset can serve as a robust benchmark to guide the development of more capable GUI-oriented MLLMs. The data's diversity and annotation quality will likely spur research into more sophisticated methods for GUI content interaction and comprehension, extending the use cases of MLLMs in real-world applications.

Theoretically, GUI-World opens avenues for exploring the integration of dynamic temporal information into existing MLLMs, addressing current limitations in handling sequential and multi-step tasks. Future developments may focus on enhancing keyframe extraction techniques, creating more specialized pretraining for GUI tasks, and improving the underlying architectures of VideoLLMs to better align with the unique demands of GUI environments.

In conclusion, the paper offers significant contributions to the field by providing a comprehensive dataset that captures the intricate and varied nature of GUIs. It highlights the limitations of current models and suggests practical pathways for improvement through rigorous benchmarking and targeted model enhancements. GUI-World stands as a pivotal resource for advancing MLLM capabilities in GUI understanding and interaction.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.