OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (2402.17553v3)
Abstract: For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline LLM agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of LLM agents in automating computer tasks and motivates future work towards building multimodal models that bridge LLMs and the visual grounding of computer screens.
- Pyautogui: A cross-platform gui automation python module for human beings. https://github.com/asweigart/pyautogui, 2023.
- Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, 2023.
- Uibert: Learning generic multimodal representations for ui understanding, 2021.
- Lexi: Self-supervised learning of the ui language, 2023.
- Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments.
- Websrc: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, page 845–854, New York, NY, USA, 2017. Association for Computing Machinery.
- Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
- Qlora: Efficient finetuning of quantized llms, 2023.
- Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023.
- Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
- Learning to navigate the web. In International Conference on Learning Representations, 2018.
- Actionbert: Leveraging user actions for semantic understanding of user interfaces, 2021.
- Pixel-bert: Aligning image pixels with text by deep multi-modal transformers, 2020.
- A data-driven approach for learning to control computers. In International Conference on Machine Learning, pages 9466–9482. PMLR, 2022.
- Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
- Segment anything, 2023.
- Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. 2022.
- Spotlight: Mobile ui understanding using vision-language models with a focus, 2023.
- Mapping natural language instructions to mobile ui action sequences, 2020a.
- Widget captioning: Generating natural language description for mobile user interface elements, 2020b.
- Vut: Versatile ui transformer for multi-modal multi-task user interface modeling, 2021.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning, 2023b.
- Chameleon: Plug-and-play compositional reasoning with large language models, 2023.
- Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
- Webgpt: Browser-assisted question-answering with human feedback.
- OpenAI. Introducing chatgpt, 2023a.
- OpenAI. Gpt-4 technical report, 2023b.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Android in the wild: A large-scale dataset for android device control, 2023.
- Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020.
- Code llama: Open foundation models for code, 2023a.
- Code llama: Open foundation models for code, 2023b.
- From pixels to ui actions: Learning to follow instructions via graphical user interfaces. arXiv preprint arXiv:2306.00245, 2023.
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
- Hierarchical prompting assists large language model on web navigation. arXiv preprint arXiv:2305.14257, 2023.
- Meta-gui: Towards multi-modal conversational agents on mobile gui. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6699–6712, 2022.
- Vipergpt: Visual inference via python execution for reasoning, 2023.
- Writer Engineering team. InstructPalmyra-30b : Instruct tuned Palmyra-Large model. https://dev.writer.com, 2023a.
- Writer Engineering team. Palmyra-base Parameter Autoregressive Language Model. https://dev.writer.com, 2023b.
- Llama: Open and efficient foundation language models, 2023.
- Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510, 2021.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023a.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
- Large-scale multi-modal pre-trained models: A comprehensive survey, 2023b.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Grounding open-domain instructions to automate web support tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1022–1032, 2021.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
- A survey on multimodal large language models, 2023.
- Bertscore: Evaluating text generation with bert, 2020.
- You only look at screens: Multimodal chain-of-action agents, 2023.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Codebertscore: Evaluating code generation with pretrained models of code. arXiv preprint arXiv:2302.05527, 2023a.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b.