Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (2401.16158v2)

Published 29 Jan 2024 in cs.CL and cs.CV

Abstract: Mobile device agent based on Multimodal LLMs (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.

References (24)

Citations (58)

View on Semantic Scholar

Summary

The paper introduces a visual perception module and self-reflection mechanism to autonomously execute and correct mobile app operations.
It leverages advanced tools like GPT-4V, OCR, Grounding DINO, and CLIP to interpret device screenshots without relying on system metadata.
Experiments using Mobile-Eval show over 90% instruction completion, demonstrating robust multi-app task execution and error correction.

Introduction

The advent of Multimodal LLMs (MLLMs) has unveiled new horizons for autonomous agents operating on mobile devices. This paper introduces Mobile-Agent, a novel vision-centric autonomous multi-modal mobile device agent that leverages MLLMs for enhanced adaptability across diverse mobile environments. Unlike traditional approaches reliant on XML files or system metadata, Mobile-Agent is designed to operate using visual perception, thus bypassing the need for system-specific customizations. The core advancements presented in this paper include the visual perception module for operation localization and a self-reflection mechanism to enhance task completion rates.

Mobile-Agent Framework

The Mobile-Agent framework integrates advanced MLLMs with visual perception for precise localization and execution of operations based solely on device screenshots. Utilizing state-of-the-art tools like GPT-4V, the agent orchestrates a workflow that includes:

Visual Perception Tools: These tools employ OCR for text localization and icon detection modules powered by Grounding DINO and CLIP for recognizing and interacting with icons on the device screen (Figure 1). This approach ensures the agent can effectively function without accessing underlying app code or metadata.

Figure 1: The framework of Mobile-Agent.

Instruction Execution: Mobile-Agent defines a set of operations such as opening apps, typing text, and navigating interfaces to execute instructions efficiently. The self-planning and self-reflection aspects are critical in correcting potential operational errors, thereby boosting the agent's robustness in dynamic environments.
Self-Reflection Mechanism: This feature allows Mobile-Agent to identify and amend errors autonomously. Upon detecting invalid operations (e.g., unchanged screenshots), the agent re-evaluates the task sequence and proceeds with alternative actions until successful completion.
Figure 2: Case of instruction comprehension and execution planning.

Experiments and Evaluation

The efficacy of Mobile-Agent was rigorously tested using Mobile-Eval, a benchmark encompassing diverse mobile applications. Results demonstrated high instruction completion rates, averaging above 90% across different complexity levels. Quantitative metrics such as Success (Su), Process Score (PS), Relative Efficiency (RE), and Completion Rate (CR) were employed to evaluate performance. Notably, the Mobile-Agent showcased an ability to execute complex multi-app instructions seamlessly, indicating its potential for broader applications.

Figure 3: Case of operating multiple Apps to search game result.

One illustrative case involves Mobile-Agent's capability to correct invalid operations through reflection, leading to successful task completion despite initial errors (Figure 4).

Figure 4: Case of self-reflection and error correction after using invalid operations.

The paper contextualizes Mobile-Agent in relation to previous work on LLM-based agents and mobile device operation systems. Where existing MLLMs like GPT-4V exhibit limitations in precise localization, Mobile-Agent's reliance on visual perception offers a marked improvement. Furthermore, while solutions like AppAgent depend on extracting actionable information from app metadata, Mobile-Agent's innovative design circumvents these requirements, substantially enhancing adaptability and ease of deployment.

Conclusion

This work presents Mobile-Agent as a promising advancement in autonomous multi-modal agents for mobile devices, employing a purely vision-centric approach that enhances operational flexibility and efficiency. The success of Mobile-Agent in handling multi-app operations, self-reflective error correction, and language-agnostic operation underpins its potential as a versatile tool in mobile environments. Future research could explore the integration of Mobile-Agent with other operating systems and further refinement of its visual perception and execution planning capabilities.