GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding (2406.10819v2)

Published 16 Jun 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Recently, Multimodal LLMs (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents primarily demonstrate strong understanding capabilities in static environments and are mainly applied to relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including Image LLMs and Video LLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that current models struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, Video LLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Therefore, we take the initial step of leveraging a fine-tuned Video LLM, GUI-Vid, as a GUI-oriented assistant, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using video LLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. All the dataset and code are publicly available at: https://gui-world.github.io.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a comprehensive GUI-WORLD dataset with over 12,000 videos to evaluate multimodal LLMs on dynamic GUI tasks.
It benchmarks state-of-the-art models, showing that human-annotated keyframes significantly enhance performance over automated extraction methods.
The study fine-tunes a VideoLLM, GUI-Vid, which outperforms baselines in sequential analysis and captioning, highlighting future potential for GUI understanding.

An Analytical Review of "GUI-World: A Dataset for GUI-oriented Multimodal LLM-based Agents"

The paper "GUI-World: A Dataset for GUI-oriented Multimodal LLM-based Agents" introduces an extensive dataset designed to enhance the capabilities of Multimodal LLMs (MLLMs) in understanding and interacting with Graphical User Interfaces (GUIs). The dataset, termed GUI-World, aims to address the primary challenges faced by current MLLMs in processing dynamic GUI content and performing multiple-step tasks across diverse GUI scenarios. The paper also explores benchmarking state-of-the-art MLLMs and fine-tuning VideoLLMs to improve their performance on GUI-oriented tasks.

Dataset Construction and Scope

The GUI-World dataset comprises over 12,000 GUI videos encompassing a variety of scenarios including software applications, websites, mobile applications (both iOS and Android), multi-window interactions, and extended reality (XR) environments. The dataset is meticulously annotated through a Human-MLLM collaborative approach, ensuring a diverse set of queries and instructions. This includes a combination of free-form questions, multiple-choice questions, and conversational queries tailored to evaluate static, dynamic, and sequential GUI content.

The data annotation process involves human annotators recording GUI interactions and keyframe extraction, which are then enhanced by LLM-generated annotations. This collaborative method ensures high-quality, comprehensive annotations that cover various GUI elements like web icons, text via OCR, and page layouts. The dataset is designed to bridge the gap between static GUI understanding and the need to handle dynamic and complex GUI tasks, which typical datasets have not addressed adequately.

Benchmarking MLLMs

The paper benchmarks several advanced MLLMs, including commercial models like GPT-4V and Gemini-Pro-1.5, as well as open-source models like Qwen-VL-Max. Despite the noted proficiency of these models in static GUI comprehension, their performance diminishes when faced with dynamic and sequential tasks. For instance, GPT-4V and GPT-4o exhibit strong performance in static content retrieval but struggle with tasks requiring an understanding of dynamic GUI changes.

Interestingly, the analysis reveals that the selection method for keyframes significantly impacts model performance. Randomly selected and human-annotated keyframes tend to yield better results compared to those extracted programmatically. This suggests that existing technologies for natural video keyframe extraction are inadequate for capturing essential GUI operations, highlighting a crucial area for future improvement.

Development of GUI-Vid

The paper introduces GUI-Vid, a fine-tuned VideoLLM model trained on the GUI-World dataset. The fine-tuning process is two-phased; the first phase aims to align basic GUI understanding through text-image pairs, while the second phase focuses on more complex tasks like sequential image reasoning and dynamic content analysis. The resulting model shows superior performance, substantially improving on baseline models and even surpassing some commercial models in specific tasks like captioning and sequential analysis.

Experimental Insights

The experiments underscore a significant finding: vision perception remains a critical component for effective sequential GUI task handling. Even though integrating detailed textual information can slightly enhance performance, the inherent ability to process and interpret visual changes within GUIs proves to be indispensable. Additionally, the paper illustrates that augmenting the model with a higher number of keyframes and increased resolution enhances overall performance, pointing towards potential pathways for further advancements.

Implications and Future Prospects

The introduction of GUI-World is poised to have profound implications, both practical and theoretical. Practically, this dataset can serve as a robust benchmark to guide the development of more capable GUI-oriented MLLMs. The data's diversity and annotation quality will likely spur research into more sophisticated methods for GUI content interaction and comprehension, extending the use cases of MLLMs in real-world applications.

Theoretically, GUI-World opens avenues for exploring the integration of dynamic temporal information into existing MLLMs, addressing current limitations in handling sequential and multi-step tasks. Future developments may focus on enhancing keyframe extraction techniques, creating more specialized pretraining for GUI tasks, and improving the underlying architectures of VideoLLMs to better align with the unique demands of GUI environments.

In conclusion, the paper offers significant contributions to the field by providing a comprehensive dataset that captures the intricate and varied nature of GUIs. It highlights the limitations of current models and suggests practical pathways for improvement through rigorous benchmarking and targeted model enhancements. GUI-World stands as a pivotal resource for advancing MLLM capabilities in GUI understanding and interaction.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/arankomatsuzaki/status/1802910401463545898

https://twitter.com/HowieH36226/status/1805329857108492408