OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (2404.07972v2)

Published 11 Apr 2024 in cs.AI and cs.CL

Abstract: Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

Citations (52)

View on Semantic Scholar

Summary

The paper introduces OSWorld, the first realistic computer environment for evaluating multimodal agents on 369 open-ended tasks.
It leverages virtual machines to simulate diverse OS environments and assess agents in real-world desktop and web application tasks.
Evaluation reveals a significant performance gap, with top agents achieving only 12.24% success compared to humans at over 72%.

Introducing OSWorld: A Real Computer Environment for Multimodal Agent Training and Evaluation

Overview of OSWorld

OSWorld is introduced as the first environment of its kind, offering a fully interactive and scalable real computer environment for the development and assessment of autonomous digital agents capable of handling diverse computer tasks. Unlike prior benchmarks focusing on specific applications or lacking interactive capability, OSWorld supports a broad range of open-ended tasks across different operating systems (OS), including Ubuntu, Windows, and macOS. This environment is designed to evaluate multimodal agents' ability to execute real-world computer tasks involving web and desktop applications, file operations, and workflows bridging multiple applications, thus overcoming the limitations of existing benchmarks.

Technical Contributions and Environment Capabilities

OSWorld's architecture facilitates task setup, execution-based evaluation, and interactive learning in a realistic computer interaction context. The environment leverages virtual machine technology to ensure task scalability across OS and applications, supporting raw keyboard and mouse control actions. A notable contribution is the creation of a benchmark comprising 369 tasks reflecting real-world computer usage, encompassing a wide range of applications and domains. These tasks are meticulously designed to include starting state configurations and execution-based evaluation scripts, providing a robust framework for reproducible and reliable assessment of agent performance.

Evaluation of Modern LLM and VLM Agents

The paper reports an extensive evaluation of contemporary LLMs and Vision-LLMs (VLM) within the OSWorld environment. Despite significant advancements in multimodal agent development, the evaluation highlights a considerable performance gap between human users and the best-performing models. For instance, while humans can successfully complete over 72% of the tasks, the highest-performing model achieves only a 12.24% success rate. This discrepancy underscores current models' challenges in mastering GUI grounding, operational knowledge, and executing long-horizon tasks involving complex workflows and multiple applications.

Insights and Implications for Future Research

The comprehensive analysis conducted using OSWorld reveals several insights into the development of multimodal generalist agents. There is a pronounced need for enhancing models' understanding of GUI elements and their operational functionalities across diverse software. The findings point towards the potential of refining agent architectures to improve their exploration and decision-making capabilities, emphasizing the importance of interactive learning and adaptation in real-world environments. Furthermore, the paper speculates on advancing VLMs' capability for high-resolution image processing and precise action prediction, which are crucial for more robust and effective GUI interactions.

Conclusions and Future Directions

The introduction of OSWorld marks a significant step towards developing more capable and generalizable digital agents for autonomous computer interaction. By providing a rich, variable, and realistic benchmarking environment, OSWorld sets the stage for future breakthroughs in agent capabilities. The research underscores the necessity for continued advancements in multimodal agent technology, highlighting areas such as enhanced GUI understanding, long-horizon planning, and operational knowledge across various applications and domains. As the field progresses, OSWorld will serve as a valuable resource for assessing and guiding the development of the next generation of digital agents.

Related Papers

GitHub

Tweets

https://twitter.com/AlexReibman/status/1784844434682560721

https://twitter.com/taoyds/status/1778785333502673400

https://twitter.com/fly51fly/status/1779267809027195215

https://twitter.com/TianbaoX/status/1778781558260011417

https://twitter.com/subhoghosh02/status/1786812828449378453

https://twitter.com/juan_pasalagua/status/1784625024533475591