WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? (2403.07718v5)
Abstract: We study the use of LLM-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.
- The unsolved challenges of LLMs in open-ended web tasks: A case study. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id=jt3il4fC5B.
- OpenAI gym, 2016.
- Mind2Web: Towards a generalist agent for the web. arXiv, abs/2306.06070, 2023.
- Multimodal web navigation with instruction-finetuned foundation models. arXiv, abs/2305.11854, 2023. URL https://arxiv.org/abs/2305.11854.
- Google. Chrome devtools protocol, 2023. URL https://chromedevtools.github.io/devtools-protocol/.
- A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023a.
- A real-world WebAgent with planning, long context understanding, and program synthesis. arXiv, abs/2307.12856, 2023b. URL https://arxiv.org/abs/2307.12856.
- WebVoyager: Building an end-to-end web agent with large multimodal models. arXiv, abs/2401.13919, 2024. URL https://arxiv.org/abs/2401.13919.
- A data-driven approach for learning to control computers. In International Conference on Machine Learning (ICML), 2022.
- Language models can solve computer tasks. arXiv, abs/2303.17491, 2023. URL https://arxiv.org/abs/2303.17491.
- Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018.
- AgentBench: Evaluating LLMs as agents. arXiv, abs/2308.03688, 2023a. URL https://arxiv.org/abs/2308.03688.
- BOLAA: Benchmarking and orchestrating LLM-augmented autonomous agents. arXiv, abs/2308.05960, 2023b.
- Maas, M. Knowledge 2020: “The digital workflow revolution has just begun”. Technical report, Sprinklr, 2020. URL https://www.linkedin.com/pulse/knowledge-2020-digital-workflow-revolution-has-just-begun-maas/.
- Mastantuono, G. ServiceNow joins the prestigious Fortune 500 list. https://www.servicenow.com/blogs/2023/servicenow-joins-fortune-500-list.html, 2023. Accessed: 2024-01-29.
- Microsoft. Playwright for Python documentation, 2023. URL https://playwright.dev/python/.
- WebGPT: Browser-assisted question-answering with human feedback. arXiv, abs/2112.09332, 2021. URL https://arxiv.org/abs/2112.09332.
- OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
- SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles. Technical report, Society of Automotive Engineers (SAE), 04 2021. URL https://doi.org/10.4271/J3016_202104.
- ServiceNow. Vancouver release notes. Online, 2023. Available at: https://docs.servicenow.com/bundle/vancouver-release-notes/.
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning (ICML), 2017a.
- World of bits: An open-domain platform for web-based agents. ICML, 2017b.
- Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
- van der Meer, J. A journey into the future of the translation industry, 2021. URL https://www.taus.net/resources/blog/a-journey-into-the-future-of-the-translation-industry. Accessed: 2024-02-01.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 24824–24837. Curran Associates, Inc., 2022b. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
- WebShop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- ReAct: Synergizing reasoning and acting in language models. arXiv, abs/2210.03629, 2023. URL https://arxiv.org/abs/2210.03629.
- Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
- Webarena: A realistic web environment for building autonomous agents. ArXiv, abs/2307.13854, 2023. URL https://arxiv.org/abs/2307.13854.