Papers
Topics
Authors
Recent
2000 character limit reached

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (2401.16158v2)

Published 29 Jan 2024 in cs.CL and cs.CV

Abstract: Mobile device agent based on Multimodal LLMs (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Modelscope-agent: Building your customizable agent system with open-source large language models. arXiv preprint arXiv:2309.00986, 2023.
  2. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint arXiv:2310.17796, 2023a.
  3. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 3, 2023b.
  4. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023c.
  5. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  6. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  7. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752, 2023a.
  8. Small llms are weak tool learners: A multi-llm agent. arXiv preprint arXiv:2401.07324, 2024.
  9. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023b.
  10. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  11. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023c.
  12. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023d.
  13. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  14. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023a.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  16. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023e.
  17. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  18. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023b.
  19. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  20. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
  21. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
  22. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023d.
  23. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  24. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023f.
Citations (58)

Summary

  • The paper introduces a visual perception module and self-reflection mechanism to autonomously execute and correct mobile app operations.
  • It leverages advanced tools like GPT-4V, OCR, Grounding DINO, and CLIP to interpret device screenshots without relying on system metadata.
  • Experiments using Mobile-Eval show over 90% instruction completion, demonstrating robust multi-app task execution and error correction.

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Introduction

The advent of Multimodal LLMs (MLLMs) has unveiled new horizons for autonomous agents operating on mobile devices. This paper introduces Mobile-Agent, a novel vision-centric autonomous multi-modal mobile device agent that leverages MLLMs for enhanced adaptability across diverse mobile environments. Unlike traditional approaches reliant on XML files or system metadata, Mobile-Agent is designed to operate using visual perception, thus bypassing the need for system-specific customizations. The core advancements presented in this paper include the visual perception module for operation localization and a self-reflection mechanism to enhance task completion rates.

Mobile-Agent Framework

The Mobile-Agent framework integrates advanced MLLMs with visual perception for precise localization and execution of operations based solely on device screenshots. Utilizing state-of-the-art tools like GPT-4V, the agent orchestrates a workflow that includes:

  1. Visual Perception Tools: These tools employ OCR for text localization and icon detection modules powered by Grounding DINO and CLIP for recognizing and interacting with icons on the device screen (Figure 1). This approach ensures the agent can effectively function without accessing underlying app code or metadata. Figure 1

Figure 1

Figure 1: The framework of Mobile-Agent.

  1. Instruction Execution: Mobile-Agent defines a set of operations such as opening apps, typing text, and navigating interfaces to execute instructions efficiently. The self-planning and self-reflection aspects are critical in correcting potential operational errors, thereby boosting the agent's robustness in dynamic environments.
  2. Self-Reflection Mechanism: This feature allows Mobile-Agent to identify and amend errors autonomously. Upon detecting invalid operations (e.g., unchanged screenshots), the agent re-evaluates the task sequence and proceeds with alternative actions until successful completion. Figure 2

    Figure 2: Case of instruction comprehension and execution planning.

Experiments and Evaluation

The efficacy of Mobile-Agent was rigorously tested using Mobile-Eval, a benchmark encompassing diverse mobile applications. Results demonstrated high instruction completion rates, averaging above 90% across different complexity levels. Quantitative metrics such as Success (Su), Process Score (PS), Relative Efficiency (RE), and Completion Rate (CR) were employed to evaluate performance. Notably, the Mobile-Agent showcased an ability to execute complex multi-app instructions seamlessly, indicating its potential for broader applications. Figure 3

Figure 3: Case of operating multiple Apps to search game result.

One illustrative case involves Mobile-Agent's capability to correct invalid operations through reflection, leading to successful task completion despite initial errors (Figure 4). Figure 4

Figure 4: Case of self-reflection and error correction after using invalid operations.

The paper contextualizes Mobile-Agent in relation to previous work on LLM-based agents and mobile device operation systems. Where existing MLLMs like GPT-4V exhibit limitations in precise localization, Mobile-Agent's reliance on visual perception offers a marked improvement. Furthermore, while solutions like AppAgent depend on extracting actionable information from app metadata, Mobile-Agent's innovative design circumvents these requirements, substantially enhancing adaptability and ease of deployment.

Conclusion

This work presents Mobile-Agent as a promising advancement in autonomous multi-modal agents for mobile devices, employing a purely vision-centric approach that enhances operational flexibility and efficiency. The success of Mobile-Agent in handling multi-app operations, self-reflective error correction, and language-agnostic operation underpins its potential as a versatile tool in mobile environments. Future research could explore the integration of Mobile-Agent with other operating systems and further refinement of its visual perception and execution planning capabilities. Figure 5

Figure 5: Case of using Amazon Music to search and play a music with specific content.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 22 tweets with 668 likes about this paper.

Youtube Logo Streamline Icon: https://streamlinehq.com