WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks (2406.13264v2)

Published 19 Jun 2024 in cs.AI, cs.LG, and cs.SE

Abstract: Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task - full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today - simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.3). We hope WONDERBREAD encourages the development of more "human-centered" AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: https://github.com/HazyResearch/wonderbread

Citations (2)

View on Semantic Scholar

Summary

The paper presents Wonderbread, a comprehensive benchmark and dataset to evaluate multimodal foundation models on enterprise BPM tasks such as documentation, knowledge transfer, and process improvement.
The paper defines six distinct BPM tasks—including SOP generation, demo segmentation, and question answering—and employs both programmatic and LLM-based evaluations to assess model performance.
The paper shows that while models like GPT-4 excel in documentation tasks, challenges remain in precise workflow segmentation and step-level validation against standard operating procedures.

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks

The paper "Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks" introduces a benchmark and dataset called Wonderbread designed to evaluate the performance of multimodal foundation models (FMs) on various business process management (BPM) tasks. Emphasizing tasks beyond mere automation, the paper highlights the significance of documentation, knowledge transfer, and process improvement in enterprise workflows. The dataset comprises 2,928 human demonstrations across 598 workflows, enriched with detailed annotations including screen recordings, action traces, standard operating procedures (SOPs), and ranked demonstration quality.

Contributions

Dataset: The authors curate a comprehensive dataset featuring 2,928 demonstrations across 598 workflows. Each workflow includes a detailed SOP, video recording, action trace, and steps annotated by human demonstrators. This dataset addresses the gap in existing ML benchmarks by providing extensive annotations necessary for BPM tasks.
Tasks: Six distinct tasks are defined to evaluate the ability of multimodal FMs to handle BPM tasks:
- Documentation: SOP generation and demonstration segmentation.
- Knowledge Transfer: Question answering and demonstration validation.
- Improvement: SOP ranking and SOP improvement.
Evaluation: An automated evaluation pipeline utilizing both programmatic metrics (e.g., F1, accuracy) and LLM-based evaluations correlated with human raters ( $\rho>0.8$ ). This dual approach ensures a robust and transparent assessment.

Key Findings

Through extensive experimentation with state-of-the-art FMs — GPT-4, Claude 3, and Gemini Pro — the paper provides insightful results on their performance across BPM tasks. These findings are significant:

SOP Generation: GPT-4 achieves the best performance with an F1 score of 0.82, showcasing its capability to generate accurate documentation. However, there is a tendency to hallucinate steps, indicating room for improvement.
Demo Segmentation: While segmenting concatenated demonstration recordings remains challenging, GPT-4 demonstrates a promising adjusted rand index of 0.88. The primary errors occur around transitions between workflows.
Question Answering: Multimodal FMs perform well on compactness and clarity but score lower on completeness. GPT-4 leads with an average score across four axes.
Demo Validation: FMs can accurately determine workflow completion (F1 of 0.90) but struggle with step-level validation against SOPs (F1 of 0.27).
SOP Improvement and Ranking: Current models show minimal capability in aligning SOP rankings with human judgment, as evidenced by a low mean Kendall correlation of 0.05.

Implications

Practical Implications: The dataset and benchmark provide a valuable resource for developing more effective human-centered AI tools for BPM. The strong performance in documentation tasks suggests that FMs can significantly aid in reducing the manual effort required in creating workflow documentation, holding potential for enhanced efficiency in enterprise settings.

Theoretical Implications: The challenges identified in workflow segmentation and step-level validation highlight areas needing further research, specifically in improving the fine-grained understanding of workflows. These findings motivate future explorations into expanding context windows in multimodal models and refining lower-level process understanding through supervised fine-tuning.

Future Developments: Future research directions include enhancing human-model alignment using reinforcement learning with human feedback and supervised fine-tuning on larger, diversified datasets. The potential to integrate open-source multimodal models and investigate their performance on BPM tasks merits further exploration. Additionally, exploring the societal implications of such technologies, particularly their impact on labor markets and the importance of augmenting rather than replacing human roles, remains crucial.

Conclusion

The Wonderbread benchmark comprehensively addresses the limitations of existing BPM evaluations, offering a nuanced and detailed approach to assessing multimodal FMs on critical enterprise tasks. The baseline results underscore the current strengths and weaknesses of these models, providing a foundation for future advancements in the field. The paper's contributions are poised to drive significant progress in the development of AI tools that support and enhance human capabilities within enterprise workflows, underscoring the need for a balanced focus on documentation, knowledge transfer, and process improvement alongside automation.

PDF Markdown

Related Papers

GitHub

GitHub - HazyResearch/wonderbread: WONDERBREAD benchmark + dataset for BPM tasks (19 stars)

Tweets

https://twitter.com/MichaelWornow/status/1803967321045897365

YouTube

Show All Videos