Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 213 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Automating the Enterprise with Foundation Models (2405.03710v1)

Published 3 May 2024 in cs.SE, cs.AI, and cs.LG

Abstract: Automating enterprise workflows could unlock $4 trillion/year in productivity gains. Despite being of interest to the data management community for decades, the ultimate vision of end-to-end workflow automation has remained elusive. Current solutions rely on process mining and robotic process automation (RPA), in which a bot is hard-coded to follow a set of predefined rules for completing a workflow. Through case studies of a hospital and large B2B enterprise, we find that the adoption of RPA has been inhibited by high set-up costs (12-18 months), unreliable execution (60% initial accuracy), and burdensome maintenance (requiring multiple FTEs). Multimodal foundation models (FMs) such as GPT-4 offer a promising new approach for end-to-end workflow automation given their generalized reasoning and planning abilities. To study these capabilities we propose ECLAIR, a system to automate enterprise workflows with minimal human supervision. We conduct initial experiments showing that multimodal FMs can address the limitations of traditional RPA with (1) near-human-level understanding of workflows (93% accuracy on a workflow understanding task) and (2) instant set-up with minimal technical barrier (based solely on a natural language description of a workflow, ECLAIR achieves end-to-end completion rates of 40%). We identify human-AI collaboration, validation, and self-improvement as open challenges, and suggest ways they can be solved with data management techniques. Code is available at: https://github.com/HazyResearch/eclair-agents

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022).
  2. Automation Anywhere. 2020. https://www.automationanywhere.com/company/press-room/global-research-reveals-worlds-most-hated-office-tasks
  3. The Unsolved Challenges of LLMs as Generalist Web Agents: A Case Study. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
  4. Automated discovery of process models from event logs: Review and benchmark. IEEE transactions on knowledge and data engineering 31, 4 (2018), 686–705.
  5. David Autor. 2014. Polanyi’s paradox and the shape of employment growth. Technical Report. National Bureau of Economic Research.
  6. Maintaining database integrity with refinement types. In European Conference on Object-Oriented Programming. Springer, 484–509.
  7. Introducing our Multimodal Models. https://www.adept.ai/blog/fuyu-8b
  8. Matthew Bayley and Ed Levine. 2013. Hospital revenue cycle operations: opportunities created by the ACA. Management (2013).
  9. Querying with access patterns and integrity constraints. Proceedings of the VLDB Endowment 8, 6 (2015), 690–701.
  10. Amanda Bergson-Shilcock and Roderick Taylor. 2023. Closing the Digital” Skill” Divide: The Payoff for Workers, Business, and the Economy. National Skills Coalition (2023).
  11. Alessandro Berti and Mahnaz Sadat Qafari. 2023. Leveraging Large Language Models (LLMs) for Process Mining (Technical Report). arXiv preprint arXiv:2307.12701 (2023).
  12. Collaborative data analytics with DataHub. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 1916.
  13. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  14. Generative AI at work. Technical Report. National Bureau of Economic Research.
  15. Fabio Casati and Ming-Chien Shan. 2000. Process automation as the foundation for e-business. In VLDB. Citeseer, 688–691.
  16. From Robotic Process Automation to Intelligent Process Automation: –Emerging Trends–. In Business Process Management: Blockchain and Robotic Process Automation Forum: BPM 2020 Blockchain and RPA Forum, Seville, Spain, September 13–18, 2020, Proceedings 18. Springer, 215–228.
  17. The economic potential of generative AI The next productivity frontier The economic potential of generative AI: The next productivity frontier.
  18. Intelligent methods for business rule processing: State-of-the-art. arXiv preprint arXiv:2311.11775 (2023).
  19. Laila Dahabiyeh and Omar Mowafi. 2023. Challenges of using RPA in auditing: A socio-technical systems approach. Intelligent Systems in Accounting, Finance and Management (2023).
  20. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL]
  21. Towards a unified agent with foundation models. arXiv preprint arXiv:2307.09668 (2023).
  22. AI-augmented business process management systems: a research manifesto. ACM Transactions on Management Information Systems 14, 1 (2023), 1–19.
  23. How well can large language models explain business processes? arXiv preprint arXiv:2401.12846 (2024).
  24. Dahlia Fernandez and Aini Aman. 2021. The challenges of implementing robotic process automation in global business services. International Journal of Business and Society 22, 3 (2021), 1269–1282.
  25. Drive like a human: Rethinking autonomous driving with large language models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 910–919.
  26. Multimodal Web Navigation with Instruction-Finetuned Foundation Models. arXiv preprint arXiv:2305.11854 (2023).
  27. An overview of workflow management: From process modeling to workflow automation infrastructure. Distributed and parallel Databases 3 (1995), 119–153.
  28. Large Language Models can accomplish Business Process Management Tasks. In International Conference on Business Process Management. Springer, 453–465.
  29. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856 (2023).
  30. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv:2401.13919 [cs.CL]
  31. From revenue cycle management to revenue excellence.
  32. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).
  33. CogAgent: A Visual Language Model for GUI Agents. arXiv preprint arXiv:2312.08914 (2023).
  34. Data management perspectives on business process management: tutorial overview. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 943–948.
  35. A data-driven approach for learning to control computers. In International Conference on Machine Learning. PMLR, 9466–9482.
  36. Robotic process automation: systematic literature review. In Business Process Management: Blockchain and Central and Eastern Europe Forum: BPM 2019 Blockchain and CEE Forum, Vienna, Austria, September 1–6, 2019, Proceedings 17. Springer, 280–295.
  37. ADEPT: An agent-based approach to business process management. ACM Sigmod Record 27, 4 (1998), 32–39.
  38. How can we know what language models know? Transactions of the Association for Computational Linguistics 8 (2020), 423–438.
  39. CHORUS: Foundation Models for Unified Data Discovery and Exploration. arXiv preprint arXiv:2306.09610 (2023).
  40. Victor Kilanko. 2023. Leveraging Artificial Intelligence for Enhanced Revenue Cycle Management in the United States. International Journal of Scientific Advances 4, 4 (2023), 505–14.
  41. Robotic process mining: vision and challenges. Business & Information Systems Engineering 63 (2021), 301–314.
  42. Xavier Lhuer. 2016. The next acronym you need to know about: RPA (robotic process automation). (2016).
  43. More agents is all you need. arXiv preprint arXiv:2402.05120 (2024).
  44. Interactive task and concept learning from natural language instructions and gui demonstrations. arXiv preprint arXiv:1909.00031 (2019).
  45. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118 (2023).
  46. Demonstration of collaborative and interactive workflow-based data analytics in texera. Proceedings of the VLDB Endowment 15, 12 (2022), 3738–3741.
  47. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960 (2023).
  48. Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data. 631–645.
  49. Interrupt Handling Schemes in Operating Systems. Springer.
  50. Process automation using RPA–a literature review. Procedia Computer Science 219 (2023), 244–254.
  51. Towards large language model-based personal agents in the enterprise: Current trends and open problems. In Findings of the Association for Computational Linguistics: EMNLP 2023. 6909–6921.
  52. Can Foundation Models Wrangle Your Data? Proceedings of the VLDB Endowment 16, 4 (2022), 738–746.
  53. R OpenAI. 2023. GPT-4 technical report. arXiv (2023), 2303–08774.
  54. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  55. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–22.
  56. Self-Driving Database Management Systems.. In CIDR, Vol. 4. 1.
  57. Make your database system dream of electric sheep: towards self-driving operation. Proceedings of the VLDB Endowment 14, 12 (2021), 3211–3221.
  58. Prototyping and implementing Robotic Process Automation in accounting firms: Benefits, challenges and opportunities to audit automation. International Journal of Accounting Information Systems 51 (2023), 100641.
  59. R1. 2022. Healthcare Financial Trends Report. https://www.r1rcm.com/news/healthcare-trends-and-data-show-clinical-shortage-tip-of-the-iceberg
  60. Worker skill estimation in team-based tasks. Proceedings of the VLDB Endowment 8, 11 (2015), 1142–1153.
  61. Lars Reinkemeyer. 2020. Process mining in action. Process Mining in Action Principles, Use Cases and Outloook (2020).
  62. A Case for Business Process-Specific Foundation Models. In International Conference on Business Process Management. Springer, 44–56.
  63. Tara Safavi and Danai Koutra. 2021. Relational world knowledge representation in contextual language models: A review. arXiv preprint arXiv:2104.05837 (2021).
  64. Invoice processing using robotic process automation. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol 6, 2 (2020), 216–223.
  65. Henriika Sarilo-Kankaanranta and Lauri Frank. 2021. The Slow Adoption Rate of Software Robotics in Accounting and Payroll Services and the Role of Resistance to Change in Innovation-Decision Process. In Conference of the Italian Chapter of AIS. Springer, 201–216.
  66. Business process cockpit. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 880–883.
  67. Fred Schulte and Erika Fry. 2019. Death by 1,000 clicks: Where electronic health records went wrong. Kaiser Health News 18 (2019).
  68. From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. arXiv preprint arXiv:2306.00245 (2023).
  69. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face. arXiv:2303.17580 [cs.CL]
  70. Reflexion: Language Agents with Verbal Reinforcement Learning.(2023). arXiv preprint cs.AI/2303.11366 (2023).
  71. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975 (2023).
  72. UIPath. 2022. UiPath Certified RPA Associate v1.0 - EXAM Description.pdf. https://start.uipath.com/rs/995-XLT-886/images/UiPath%20Certified%20RPA%20Associate%20v1.0%20-%20EXAM%20Description.pdf
  73. Wil MP Van der Aalst. 2014. Process mining in the large: a tutorial. Business Intelligence: Third European Summer School, eBISS 2013, Dagstuhl Castle, Germany, July 7-12, 2013, Tutorial Lectures 3 (2014), 33–76.
  74. Large Language Models for Business Process Management: Opportunities and Challenges. arXiv preprint arXiv:2304.04309 (2023).
  75. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
  76. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023).
  77. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997 (2023).
  78. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  79. Judith Wewerka and Manfred Reichert. 2020. Robotic Process Automation–A Systematic Literature Review and Assessment Framework. arXiv preprint arXiv:2012.11951 (2020).
  80. WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–14.
  81. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
  82. OS-Copilot: Towards Generalist Computer Agents with Self-Improvement. arXiv preprint arXiv:2402.07456 (2024).
  83. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562 (2023).
  84. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023).
  85. AppAgent: Multimodal Agents as Smartphone Users. arXiv preprint arXiv:2312.13771 (2023).
  86. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
  87. ProAgent: From Robotic Process Automation to Agentic Process Automation. arXiv preprint arXiv:2311.10751 (2023).
  88. Agflow: Agent-based cross-enterprise workflow management system. In VLDB. 697–698.
  89. UFO: A UI-Focused Agent for Windows OS Interaction. arXiv preprint arXiv:2402.07939 (2024).
  90. Vision-Language Models for Vision Tasks: A Survey. arXiv:2304.00685 [cs.CV]
  91. GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv:2401.01614 [cs.IR]
  92. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023).
Citations (4)

Summary

  • The paper presents a novel system, ECLAIR, that harnesses multimodal foundation models to automate enterprise workflows with minimal human oversight.
  • It demonstrates how integrating visual demonstrations and textual guidelines improves workflow documentation accuracy (precision ~0.94, recall ~0.95) versus traditional RPA.
  • The study highlights the ability of ECLAIR to double execution rates and reduce manual intervention by addressing RPA’s brittleness and high maintenance costs.

Automating the Enterprise with Foundation Models

The paper "Automating the Enterprise with Foundation Models" (2405.03710) presents a compelling exploration into the application of Multimodal Foundation Models (FMs) for the automation of enterprise workflows. The authors identify the potential for significant productivity gains and address the limitations of traditional Robotic Process Automation (RPA) methods, proposing an innovative system named ECLAIR (Enterprise Scale AI for Workflows) that seeks to leverage the capabilities of multimodal FMs to achieve automated enterprise workflows with minimal human supervision.

Introduction

Digital workflows are pervasive in modern economic structures, with a significant proportion of jobs involving repetitive, manual digital tasks. These tasks, often peripheral to core job functions, represent a substantial opportunity for automation, projected to translate into approximately \$4 trillion in annual productivity gains. Current solutions, mainly relying on RPA, have struggled due to high setup costs, brittle execution, and expensive maintenance necessitating multiple full-time equivalents (FTEs) dedicated to their upkeep. In many industry sectors, the limitations of RPA, stemming from its inflexible, rule-based nature, render it insufficient for achieving holistic workflow automation. Figure 1

Figure 1: Differences between and traditional RPA. **ECLAIR

* uses FMs to learn expertise via video demonstrations (left), navigate GUIs given written documentation (center), and audit completed workflows (right).*

Traditional RPAs are reliant on process mining to encode human expertise into strict rules for software agents to follow. High setup costs, lack of adaptability, and the need for continuous maintenance have restricted broader adoption, limiting the scalability and broader application across enterprises.

Introduction to

The paper introduces a novel system, ECLAIR ("Enterprise Scale AI for Workflows"), leveraging the capabilities of multimodal Foundation Models (FMs) like GPT-4. With attributes such as high-level reasoning, planning, and visual understanding, FMs offer potential solutions to the limitations of conventional RPA systems. Figure 1

Figure 1: Differences between **ECLAIR

* and traditional RPA. ECLAIR uses FMs to learn expertise via video demonstrations (left), navigate GUIs given written documentation (center), and audit completed workflows (right).*

Multimodal Foundation Models Overcoming RPA Limitations

The paper identifies several critical shortcomings in traditional RPA deployments which FMs are uniquely positioned to overcome:

  1. High Set-Up Costs: RPA requires manual, rule-based workflow scripting and significant setup time, often over a year and at substantial costs as evidenced by a case paper involving a hospital deployment.
  2. Brittle Execution: RPA's reliance on static rules leads to frequent failures whenever input conditions vary, resulting in initial accuracy as low as 60% in case studies.
  3. Maintenance Demand: Continuous monitoring by full-time employees (FTEs) is often necessary for RPA efficacy, making it burdensome for enterprises.

Demonstrating and Learning from Workflows

ECLAIR endeavors to approximate human-level workflow understanding through multimodal FMs, efficiently minimizing costs associated with traditional RPA.

Experimentation in Workflow Understanding

The paper evaluated GPT-4's ability to infer and document workflow steps based on multimodal inputs including workflow descriptions, visual demonstrations, and annotations. As highlighted by the results in Table 1, incorporating both visual (key frames) and action logs when prompting GPT-4 leads to SOPs with high accuracy and completeness rates (precision of 0.94, recall of 0.95, and SOP correctness of 0.93), albeit still facing challenges related to the hallucination of incorrect steps and errors in grounding action descriptions to precise GUI elements. While early results are promising, enhanced models that can process video inputs directly may provide further improvements.

Execution and Workflow Automation

The execute phase presented a mixed performance when tasking multimodal FMs with accurately predicting actions and grounding those into GUI commands.

Action Suggestion and Grounding

The research demonstrated that offering SOP-based guidance doubled execution rates for workflows sampled from WebArena, with GPT-4 achieving a 0.92 next-action suggestion accuracy and a 40% completion accuracy. Challenges persist in translating high-level action suggestions into precise GUI interactions (Table 2). Enhancements can be made through integrating bounding box detection, leveraging models such as YOLO, which have proven to improve grounding accuracy substantially. Figure 2

Figure 2: **ECLAIR

* can automate entirely new categories of workflows, including those that require complex decision-making or are knowledge-intensive. Examples show real-world hospital workflows.*

Validation and Self-Improvement

For seamless deployment, validation at both individual step and workflow levels is critical. Despite challenges in step-level validation due to issues in detecting discrete interface interactions, FMs have demonstrated efficacy in determining step-by-step correctness and overall workflow completion (Table 3). The development of a database of integrity constraints and techniques from self-adjusting databases could further optimize ECLAIR's capacity for error correction and self-monitoring.

Conclusions

The implementation of ECLAIR, when expanded and refined, heralds a shift in automating comprehensive workflow processes, circumventing the existing limitations of setup costs, execution brittleness, and high maintenance seen with RPA. Future research should investigate combining multiple specialized agents to cater to complex, nuanced workflows and incorporating knowledge accumulation for task automation at enterprise scales, as delineated by recent economic forecasts. Enhancing both GUI element grounding and self-monitoring of multimodal FMs are pivotal areas for further development.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com