Emergent Mind

Abstract

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

OSWorld: A scalable environment for evaluating agents on open-ended computer tasks across operating systems.

Overview

  • OSWorld introduces a scalable real computer environment for training and evaluating autonomous digital agents capable of handling diverse tasks across different operating systems (Ubuntu, Windows, macOS).

  • It leverages virtual machine technology to support task scalability and provides a benchmark comprising 369 real-world tasks to assess agent performance comprehensively.

  • Extensive evaluation of LLMs (LLM) and Vision-Language Models (VLM) reveals a significant performance gap between these models and human users, highlighting the challenges in GUI grounding and operational knowledge.

  • The research underscores the need for enhancing agents' understanding of GUI elements, decision-making capabilities, and suggests refining VLMs for better GUI interactions.

Introducing OSWorld: A Real Computer Environment for Multimodal Agent Training and Evaluation

Overview of OSWorld

OSWorld is introduced as the first environment of its kind, offering a fully interactive and scalable real computer environment for the development and assessment of autonomous digital agents capable of handling diverse computer tasks. Unlike prior benchmarks focusing on specific applications or lacking interactive capability, OSWorld supports a broad range of open-ended tasks across different operating systems (OS), including Ubuntu, Windows, and macOS. This environment is designed to evaluate multimodal agents' ability to execute real-world computer tasks involving web and desktop applications, file operations, and workflows bridging multiple applications, thus overcoming the limitations of existing benchmarks.

Technical Contributions and Environment Capabilities

OSWorld's architecture facilitates task setup, execution-based evaluation, and interactive learning in a realistic computer interaction context. The environment leverages virtual machine technology to ensure task scalability across OS and applications, supporting raw keyboard and mouse control actions. A notable contribution is the creation of a benchmark comprising 369 tasks reflecting real-world computer usage, encompassing a wide range of applications and domains. These tasks are meticulously designed to include starting state configurations and execution-based evaluation scripts, providing a robust framework for reproducible and reliable assessment of agent performance.

Evaluation of Modern LLM and VLM Agents

The paper reports an extensive evaluation of contemporary LLMs (LLM) and Vision-Language Models (VLM) within the OSWorld environment. Despite significant advancements in multimodal agent development, the evaluation highlights a considerable performance gap between human users and the best-performing models. For instance, while humans can successfully complete over 72% of the tasks, the highest-performing model achieves only a 12.24% success rate. This discrepancy underscores current models' challenges in mastering GUI grounding, operational knowledge, and executing long-horizon tasks involving complex workflows and multiple applications.

Insights and Implications for Future Research

The comprehensive analysis conducted using OSWorld reveals several insights into the development of multimodal generalist agents. There is a pronounced need for enhancing models' understanding of GUI elements and their operational functionalities across diverse software. The findings point towards the potential of refining agent architectures to improve their exploration and decision-making capabilities, emphasizing the importance of interactive learning and adaptation in real-world environments. Furthermore, the study speculates on advancing VLMs' capability for high-resolution image processing and precise action prediction, which are crucial for more robust and effective GUI interactions.

Conclusions and Future Directions

The introduction of OSWorld marks a significant step towards developing more capable and generalizable digital agents for autonomous computer interaction. By providing a rich, variable, and realistic benchmarking environment, OSWorld sets the stage for future breakthroughs in agent capabilities. The research underscores the necessity for continued advancements in multimodal agent technology, highlighting areas such as enhanced GUI understanding, long-horizon planning, and operational knowledge across various applications and domains. As the field progresses, OSWorld will serve as a valuable resource for assessing and guiding the development of the next generation of digital agents.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit
The OSWorld: Benchmarking AI Agents paper (1 point, 0 comments) in /r/aiagents