Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (2407.13032v1)

Published 17 Jul 2024 in cs.AI

Abstract: AI Agents are changing the way work gets done, both in consumer and enterprise domains. However, the design patterns and architectures to build highly capable agents or multi-agent systems are still developing, and the understanding of the implication of various design choices and algorithms is still evolving. In this paper, we present our work on building a novel web agent, Agent-E \footnote{Our code is available at \url{https://github.com/EmergenceAI/Agent-E}}. Agent-E introduces numerous architectural improvements over prior state-of-the-art web agents such as hierarchical architecture, flexible DOM distillation and denoising method, and the concept of \textit{change observation} to guide the agent towards more accurate performance. We first present the results of an evaluation of Agent-E on WebVoyager benchmark dataset and show that Agent-E beats other SOTA text and multi-modal web agents on this benchmark in most categories by 10-30\%. We then synthesize our learnings from the development of Agent-E into general design principles for developing agentic systems. These include the use of domain-specific primitive skills, the importance of distillation and de-noising of environmental observations, the advantages of a hierarchical architecture, and the role of agentic self-improvement to enhance agent efficiency and efficacy as the agent gathers experience.

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel hierarchical framework that integrates planning and browser navigation to enhance autonomous web agent performance.
The paper demonstrates significant performance improvements, achieving a 73.2% success rate on the WebVoyager benchmark with robust error detection and recovery.
The paper distills eight foundational design principles offering actionable guidance for building reliable, agentic systems across diverse applications.

Agent-E: Hierarchical Architectures and Foundational Design Principles for Autonomous Web Agents

Introduction

Agent-E presents a significant advancement in the design and evaluation of autonomous web agents, introducing a hierarchical architecture, flexible DOM distillation, and the concept of change observation. The system is evaluated on the WebVoyager benchmark, where it demonstrates substantial improvements over prior state-of-the-art agents in both text-only and multi-modal settings. The paper not only details the architectural and algorithmic innovations of Agent-E but also distills a set of generalizable design principles for agentic systems, with implications extending beyond web automation.

Figure 1: Simplified anatomy of web agents, highlighting the separation between sensing and acting components.

System Architecture

Agent-E is structured around two primary LLM-powered agents: the Planner Agent and the Browser Navigation Agent. This hierarchical decomposition enables clear separation of concerns: the Planner Agent is responsible for decomposing user tasks into sub-tasks and orchestrating their execution, while the Browser Navigation Agent executes these sub-tasks by interacting with the web environment through a set of primitive skills.

Figure 2: A high-level architecture of Agent-E, showing the interaction between planner, browser navigation agent, and their respective skill executors.

The system leverages the Autogen framework for multi-agent orchestration and Playwright for browser automation. Each agent is equipped with a set of Python-based skills, which are exposed to the LLM via function calling. Notably, Agent-E does not distinguish between sensing and acting skills at the interface level, allowing for flexible composition.

Figure 3: Conceptual flow diagram of Agent-E, illustrating the delegation of sub-tasks and the use of a single LLM call to perform multiple functions.

Skills and DOM Distillation

A key innovation in Agent-E is the flexible approach to DOM distillation. The Browser Navigation Agent can select among multiple DOM representations—text_only, input_fields, and all_fields—depending on the requirements of the current sub-task. This adaptability is critical for managing the large and noisy DOMs typical of modern web pages, optimizing both the relevance and the size of the context provided to the LLM.

Figure 4: Skills registered to the Browser Navigation Agent, enabling both sensing and acting on the web page.

Agent-E also introduces a custom identifier (mmid) for DOM elements, facilitating robust element selection and interaction. The DOM de-noising process preserves parent-child relationships where relevant, in contrast to the flat encodings used in prior work.

Change Observation and Error Awareness

Agent-E implements "change observation," wherein each action skill not only executes the intended operation but also observes and reports on the resulting state change. This is achieved via the Mutation Observer Web API and attribute monitoring, providing immediate, structured feedback to the LLM. This mechanism is conceptually related to the Reflexion paradigm but is not limited to post-failure analysis; it provides continuous, action-level feedback, improving grounding and reducing error propagation.

Figure 5: Example of Agent-E execution, showing planner-browser navigation agent communication for a complex user task.

Figure 6: Nested chat execution loop for a sub-task, demonstrating the use of primitive skills and change observation.

Evaluation on WebVoyager

Agent-E is evaluated on the WebVoyager benchmark, which comprises 643 tasks across 15 real-world websites. The evaluation protocol includes not only task success rates but also error awareness (self-aware vs. oblivious failures), task completion times, and the number of LLM calls per task.

Figure 7: The set of websites included in the WebVoyager benchmark.

Agent-E achieves a 73.2% overall task success rate, outperforming the previous best text-only agent (Wilbur) by 21% and the best multi-modal agent by 16%. The system is self-aware in over 52% of its failures, a critical property for safe deployment and human-in-the-loop workflows. Task completion times average 150 seconds for successful tasks and 220 seconds for failures, with an average of 25 LLM calls per task.

Qualitative Analysis and Design Principles

The hierarchical architecture enables robust error detection, recovery, and backtracking. The planner can verify sub-task outcomes and re-plan as needed, leveraging the separation of planning and execution. The flexible DOM observation methods allow the agent to adapt to task-specific requirements, and change observation provides continuous feedback, improving both grounding and efficiency.

Figure 8: Example of Agent-E detecting and recovering from errors, illustrating planner-driven re-planning.

Figure 9: Example of flexible DOM observation, with the agent selecting the most appropriate representation for the task.

From these empirical results and system design choices, the paper synthesizes eight foundational design principles for agentic systems:

Well-crafted primitive skills are essential for compositional generalization and robust performance.
Hierarchical architectures facilitate complex task decomposition, verification, and modular development.
Payload denoising (e.g., DOM distillation) is critical for efficiency and accuracy.
Linguistic feedback of actions (change observation) improves agent grounding and error recovery.
Human-in-the-loop support is necessary for trust, safety, and continuous improvement.
Routine analysis and aggregation of past experiences enable self-improvement and hybridization with classical automation.
Internal and external guardrails are required for safe and effective operation.
Task-specific vs. generic agent design should be chosen based on deployment requirements.

Implications and Future Directions

Agent-E demonstrates that hierarchical, skill-based architectures with adaptive observation and feedback mechanisms can substantially improve the reliability and efficiency of autonomous web agents. The explicit reporting of error awareness and resource usage sets a new standard for agent evaluation, moving beyond simple success rates.

The design principles articulated in the paper are broadly applicable to agentic systems in other domains, including device automation, robotic control, and enterprise workflows. Future work may focus on integrating vision-based capabilities, optimizing for specific domains, and developing more sophisticated self-improvement and caching strategies. The modularity of Agent-E's architecture also facilitates experimentation with alternative planning and execution paradigms, including reinforcement learning and hybrid symbolic-neural approaches.

Conclusion

Agent-E advances the state of the art in autonomous web agents through a combination of hierarchical planning, flexible DOM distillation, and continuous change observation. Its strong empirical performance on the WebVoyager benchmark, coupled with comprehensive error analysis and resource reporting, provides a robust foundation for future research and deployment. The design principles distilled from Agent-E's development offer actionable guidance for practitioners building agentic systems across a range of application domains.