Emergent Mind

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

(2402.17553)
Published Feb 27, 2024 in cs.AI , cs.CL , cs.CV , and cs.HC

Abstract

For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge LLMs and the visual grounding of computer screens.

Baseline model architecture filters UI elements and uses them in a prompt for automation script generation.

Overview

  • OmniACT introduces a novel dataset and benchmark aimed at enhancing autonomous agents' ability to perform tasks based on natural language instructions in desktop and web applications.

  • The dataset includes over 9.8K task pairs across diverse operating systems (macOS, Windows, Linux) and web domains, with a focus on visually-grounded instructions and UI navigation.

  • Despite GPT-4's efforts, current AI models achieve only 15% of human proficiency in completing OmniACT tasks, highlighting the complexity and necessity for advancements in multimodal AI models.

  • The paper emphasizes the practical implications for making technology accessible and the theoretical importance of developing sophisticated multimodal models that integrate visual cues with natural language processing.

OmniACT: Setting New Benchmarks for Multimodal Autonomous Agents in Desktop and Web Environments

Overview

Recent advancements in AI have aimed to simplify human-computer interactions by developing autonomous virtual agents capable of executing tasks with minimal human input. These tasks, ranging from mundane activities like playing music to more complex sequences such as sending emails, significantly depend on the agent's ability to interpret natural language instructions and transform them into executable actions. Despite the proliferation of such intelligent systems, the gap between human proficiency and autonomous agents remains vast, particularly in multimodal contexts involving both desktop and web applications. To bridge this gap, the paper introduces OmniACT, a novel dataset and benchmark designed to assess the capabilities of autonomous agents in generating executable programs for comprehensive computer tasks based on visually-grounded natural language instructions.

OmniACT Dataset: A New Frontier

The OmniACT dataset is unprecedented in its scope, encompassing a wide array of tasks across various desktop and web applications. With over 9.8K task pairs, including screenshots of user interfaces (UIs) and corresponding natural language instructions, OmniACT extends beyond conventional web automation. The dataset's unique challenge lies in the agent's need to navigate through different operating systems (macOS, Windows, Linux) and web domains, making it the first dataset to focus on such a diverse range of applications for autonomous agents.

Methodological Insights

The paper lays out an exhaustive methodology for dataset preparation, focusing on the compilation of tasks that span across multiple domains on both desktop and web applications. By carefully annotating UI elements and collecting tasks through human annotation, the researchers ensured the dataset's relevance and complexity. Key to this process was the development of PyAutoGUI-derived executable tasks, offering a pragmatic approach to automating user interactions across varied applications.

Performance Benchmarking

Evaluating several state-of-the-art language model-based agents, including GPT-4, the study encapsulates the challenges inherent in the OmniACT benchmark. Despite GPT-4's superior performance relative to other baselines, it achieves only 15% of human proficiency, underscoring the significant challenge the OmniACT tasks present to current AI models. This finding not only illustrates the dataset's complexity but also highlights the necessity for advancements in multimodal models that can better understand and interact with both visual and textual information.

Implications and Future Directions

The implications of this research are twofold. Practically, improving autonomous agents' performance on OmniACT tasks could revolutionize how we interact with computers, making technology more accessible to users with limited technical skills and streamlining routine tasks. Theoretically, the research underscores the importance of developing more sophisticated multimodal models that integrate visual cues with natural language processing. As such models evolve, we can anticipate significant breakthroughs in AI's ability to understand and navigate complex, multimodal environments.

Concluding Thoughts

In conclusion, OmniACT represents a substantial step forward in the quest to develop generalist autonomous agents capable of executing a broad spectrum of computer tasks. By providing a challenging benchmark, the dataset not only facilitates the evaluation of current AI models but also sets a clear direction for future research. Enhancing the capabilities of autonomous agents in this domain will undoubtedly have far-reaching implications, from the democratization of technology to the automation of laborious tasks, heralding a new era in human-computer interaction.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.