AppAgent: Multimodal Agents as Smartphone Users (2312.13771v2)
Abstract: Recent advancements in LLMs have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074.
- Multimodal web navigation with instruction-finetuned foundation models.
- A real-world webagent with planning, long context understanding, and program synthesis.
- Chartllama: A multimodal llm for chart understanding and generation.
- Metagpt: Meta programming for a multi-agent collaborative framework.
- Zhiting Hu and Tianmin Shu. 2023. Language models, agent models, and world models: The law for machine reasoning and planning. arXiv preprint arXiv:2312.05230.
- Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data.
- Improved baselines with visual instruction tuning.
- Visual instruction tuning.
- AgentBench: Evaluating LLMs as agents. arXiv preprint arXiv: 2308.03688.
- OpenAI. 2021. Chatgpt. https://openai.com/research/chatgpt.
- OpenAI. 2023. Gpt-4 technical report.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
- Communicative agents for software development. arXiv preprint arXiv:2307.07924.
- A generalist agent. arXiv preprint arXiv:2205.06175.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. In Advances in Neural Information Processing Systems.
- 3d-gpt: Procedural 3d modeling with large language models. arXiv preprint arXiv:2310.12945.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
- Openagents: An open platform for language agents in the wild.
- Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658.
- Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv: 2311.07562.
- Auto-gpt for online decision making: Benchmarks and additional opinions.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv: 2309.17421.
- ReAct: Synergizing reasoning and acting in language models. In ICLR.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Judging llm-as-a-judge with mt-bench and chatbot arena.