Emergent Mind

Abstract

Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task - full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today - simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.3). We hope WONDERBREAD encourages the development of more "human-centered" AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: https://github.com/HazyResearch/wonderbread

Three-part web navigation task study with human demonstrations, BPM tasks, and automated evaluation pipelines.

Overview

  • The paper introduces a benchmark and dataset named Wonderbread, designed to evaluate multimodal foundation models (FMs) on various business process management (BPM) tasks, emphasizing documentation, knowledge transfer, and process improvement.

  • Wonderbread comprises 2,928 human demonstrations across 598 workflows, enriched with detailed annotations such as screen recordings, action traces, and standard operating procedures (SOPs).

  • Key findings reveal strengths and weaknesses of state-of-the-art FMs like GPT-4 in tasks such as SOP generation, demo segmentation, and question answering, providing insights and motivating future research directions.

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks

The paper "Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks" introduces a benchmark and dataset called Wonderbread designed to evaluate the performance of multimodal foundation models (FMs) on various business process management (BPM) tasks. Emphasizing tasks beyond mere automation, the paper highlights the significance of documentation, knowledge transfer, and process improvement in enterprise workflows. The dataset comprises 2,928 human demonstrations across 598 workflows, enriched with detailed annotations including screen recordings, action traces, standard operating procedures (SOPs), and ranked demonstration quality.

Contributions

  1. Dataset: The authors curate a comprehensive dataset featuring 2,928 demonstrations across 598 workflows. Each workflow includes a detailed SOP, video recording, action trace, and steps annotated by human demonstrators. This dataset addresses the gap in existing ML benchmarks by providing extensive annotations necessary for BPM tasks.
  2. Tasks: Six distinct tasks are defined to evaluate the ability of multimodal FMs to handle BPM tasks:
  • Documentation: SOP generation and demonstration segmentation.
  • Knowledge Transfer: Question answering and demonstration validation.
  • Improvement: SOP ranking and SOP improvement.
  • Evaluation: An automated evaluation pipeline utilizing both programmatic metrics (e.g., F1, accuracy) and LLM-based evaluations correlated with human raters ($\rho>0.8$). This dual approach ensures a robust and transparent assessment.

Key Findings

Through extensive experimentation with state-of-the-art FMs — GPT-4, Claude 3, and Gemini Pro — the study provides insightful results on their performance across BPM tasks. These findings are significant:

  • SOP Generation: GPT-4 achieves the best performance with an F1 score of 0.82, showcasing its capability to generate accurate documentation. However, there is a tendency to hallucinate steps, indicating room for improvement.
  • Demo Segmentation: While segmenting concatenated demonstration recordings remains challenging, GPT-4 demonstrates a promising adjusted rand index of 0.88. The primary errors occur around transitions between workflows.
  • Question Answering: Multimodal FMs perform well on compactness and clarity but score lower on completeness. GPT-4 leads with an average score across four axes.
  • Demo Validation: FMs can accurately determine workflow completion (F1 of 0.90) but struggle with step-level validation against SOPs (F1 of 0.27).
  • SOP Improvement and Ranking: Current models show minimal capability in aligning SOP rankings with human judgment, as evidenced by a low mean Kendall correlation of 0.05.

Implications

Practical Implications: The dataset and benchmark provide a valuable resource for developing more effective human-centered AI tools for BPM. The strong performance in documentation tasks suggests that FMs can significantly aid in reducing the manual effort required in creating workflow documentation, holding potential for enhanced efficiency in enterprise settings.

Theoretical Implications: The challenges identified in workflow segmentation and step-level validation highlight areas needing further research, specifically in improving the fine-grained understanding of workflows. These findings motivate future explorations into expanding context windows in multimodal models and refining lower-level process understanding through supervised fine-tuning.

Future Developments: Future research directions include enhancing human-model alignment using reinforcement learning with human feedback and supervised fine-tuning on larger, diversified datasets. The potential to integrate open-source multimodal models and investigate their performance on BPM tasks merits further exploration. Additionally, exploring the societal implications of such technologies, particularly their impact on labor markets and the importance of augmenting rather than replacing human roles, remains crucial.

Conclusion

The Wonderbread benchmark comprehensively addresses the limitations of existing BPM evaluations, offering a nuanced and detailed approach to assessing multimodal FMs on critical enterprise tasks. The baseline results underscore the current strengths and weaknesses of these models, providing a foundation for future advancements in the field. The paper's contributions are poised to drive significant progress in the development of AI tools that support and enhance human capabilities within enterprise workflows, underscoring the need for a balanced focus on documentation, knowledge transfer, and process improvement alongside automation.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.