Emergent Mind

Abstract

Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal LLMs (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks, task progress navigation and focus content navigation, are significantly complicated under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent generates task progress, making the navigation of history operations more efficient. To retain focus content, we design a memory unit that updates with task progress. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.

Operation process and interaction of agent roles in Mobile-Agent-v2.

Overview

  • Mobile-Agent-v2 addresses the limitations of single-agent systems in handling complex mobile device operations by introducing a multi-agent architecture with specialized roles for planning, decision-making, and reflection.

  • The system comprises three agents: Planning Agent for summarizing task histories, Decision Agent for processing task progress and making decisions, and Reflection Agent for identifying and correcting errors in operations.

  • Experiments show that Mobile-Agent-v2 improves task completion rates by over 30%, offering a robust solution for navigating and managing complex, interleaved task sequences on mobile devices.

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

The paper "Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration" presents a multi-agent architecture designed to address the limitations of single-agent systems in managing mobile device operation tasks, which involve extensive and complex sequences of interleaved text and image data.

Introduction and Background

Mobile device operation tasks have traditionally fallen short in multitasking, primarily due to the constraints in navigation and focus management within current Multimodal LLMs (MLLMs). The artifacts of such limitations include reduced performance in long sequences and mixed data formats, making effective navigation through task progress and focus content particularly challenging. The emergence of MLLM-based agents, which augment MLLMs with tool invocation for extended capabilities, has led to novel solutions but they fail to address the navigation issues inherent in operations on mobile devices.

Architecture and Methodology

Mobile-Agent-v2 introduces a multi-agent system composed of three specialized agents: planning agent, decision agent, and reflection agent. Each agent plays distinct roles to collectively enhance the navigation and decision-making processes:

  1. Planning Agent: This agent addresses the complexity of lengthy operational histories by summarizing and condensing these histories into manageable pure-text task progress. This task progress, handed over to the decision agent, facilitates easier navigation and decision-making by reducing the context length.
  2. Decision Agent: Operating within a visual perception module, the decision agent processes the condensed task progress and makes informed operation decisions. It is responsible for updating the memory unit with relevant focus content that could be referenced for future steps. This function ensures the agent can maintain an accurate focal context from past screens.
  3. Reflection Agent: To manage and correct potential erroneous operations, the reflection agent assesses the outcomes of each operation relative to the expected results. By analyzing screen changes before and after operations, it identifies, categorizes, and responds to erroneous and ineffective operations, thereby enhancing the reliability of task execution.

Numerical Results and Performance

The paper's experimental results exhibit substantial improvements over the preceding Mobile-Agent architecture. Mobile-Agent-v2 demonstrates over a 30% increase in task completion rates, underscoring the efficacy of multi-agent collaboration. Tasks involving multi-step and interleaved modalities benefit significantly from the new architecture, as it effectively navigates through and manages the complexities of mobile device operation.

Evaluation Metrics

To quantify performance improvements, several metrics were employed, including Success Rate (SR), Completion Rate (CR), Decision Accuracy (DA), and Reflection Accuracy (RA). The results illustrate that Mobile-Agent-v2 not only performs better in terms of task success but also in the accuracy of decisions and reflections, revealing an overall enhancement in the robustness and precision of mobile device operations.

Implications and Future Directions

The theoretical implications of this research extend to restructuring navigation and task management within multi-modal applications, potentially enhancing the adaptability of MLLMs across various domains. Practically, Mobile-Agent-v2 sets a new standard for mobile device assistants, providing a scalable solution that can handle complex and extensive operational sequences.

Future work might explore further optimization of each agent's capability, possibly integrating more sophisticated memory units or exploring automated ways to inject operation knowledge enhancements. Moreover, the adaptability of this multi-agent framework across diverse devices and operating environments presents a promising area for continued research and development.

Conclusion

Mobile-Agent-v2 demonstrates a notable shift from single-agent to multi-agent architectures in mobile device operation assistance. By segregating the roles of planning, decision-making, and reflection into specialized agents, it provides a robust solution to the challenges posed by lengthy, interleaved task sequences on mobile devices. This advancement not only signifies a methodological improvement but also pushes the boundaries of effective mobile device operation within AI systems, setting a strong precedent for future explorations in this domain.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube