Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Large Multimodal Agents: A Survey (2402.15116v1)

Published 23 Feb 2024 in cs.CV, cs.AI, and cs.CL

Abstract: LLMs have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities akin to humans. Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents ( LMAs for short). First, we introduce the essential components involved in developing LMAs and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks integrating multiple LMAs , enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, hindering effective comparison among different LMAs . Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons. Concluding our review, we highlight the extensive applications of LMAs and propose possible future research directions. Our discussion aims to provide valuable insights and guidelines for future research in this rapidly evolving field. An up-to-date resource list is available at https://github.com/jun0wanan/awesome-large-multimodal-agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2:3, 2023.
  2. Whisper: Tracing the spatiotemporal process of information diffusion in real time. IEEE transactions on visualization and computer graphics, 18(12):2649–2658, 2012.
  3. Large language models are visual reasoning coordinators. arXiv preprint arXiv:2310.15166, 2023.
  4. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571, 2023.
  5. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500, 2023.
  6. ddupont808. Gpt-4v-act. https://github.com/ddupont808/GPT-4V-Act, 2023. Accessed on February 23, 2024.
  7. Drive like a human: Rethinking autonomous driving with large language models. arXiv preprint arXiv:2307.07162, 2023.
  8. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
  9. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  10. Clova: A closed-loop visual assistant with tool usage and update. arXiv preprint arXiv:2312.10908, 2023.
  11. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  12. Avis: Autonomous visual information seeking with large language model agent. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  13. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995, 2023.
  14. Vision-by-language for training-free compositional image retrieval. arXiv preprint arXiv:2310.09291, 2023.
  15. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  16. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024.
  17. Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation. arXiv preprint arXiv:2312.03003, 2023.
  18. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  20. Mulan: Multimodal-llm agent for progressive multi-object diffusion. arXiv preprint arXiv:2402.12741, 2024.
  21. A systematic investigation of commonsense knowledge in large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11838–11855, 2022.
  22. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. Advances in Neural Information Processing Systems, 36, 2024.
  23. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023.
  24. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  25. Towards robust multi-modal reasoning via model selection. arXiv preprint arXiv:2310.08446, 2023.
  26. Wavjourney: Compositional audio creation with large language models. arXiv preprint arXiv:2307.14335, 2023.
  27. Multimodal embodied interactive agent for cafe scene. arXiv preprint arXiv:2402.00290, 2024.
  28. Paddleseg: A high-efficient development toolkit for image segmentation. arXiv preprint arXiv:2101.06175, 2021.
  29. Discuss before moving: Visual language navigation via multi-expert discussions. arXiv preprint arXiv:2309.11382, 2023.
  30. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  31. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024.
  32. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
  33. Autonomous driving: technical, legal and social aspects. Springer Nature, 2016.
  34. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
  35. Policy-focused agent-based modeling using rl behavioral models. arXiv preprint arXiv:2006.05048, 2020.
  36. Large language models and knowledge graphs: Opportunities and challenges. arXiv preprint arXiv:2308.06374, 2023.
  37. Mp5: A multi-modal open-ended embodied system in minecraft via active perception. arXiv preprint arXiv:2312.07472, 2023.
  38. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  40. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  41. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  42. Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023.
  43. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  44. Webwise: Web interface control and sequential exploration with large language models. arXiv preprint arXiv:2310.16042, 2023.
  45. Grid: A platform for general robot intelligence development. arXiv preprint arXiv:2310.00887, 2023.
  46. Mllm-tool: A multimodal large language model for tool agent learning. arXiv preprint arXiv:2401.10727, 2024.
  47. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407, 2023.
  48. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024.
  49. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
  50. Completely model-free rl-based consensus of continuous-time multi-agent systems. Applied Mathematics and Computation, 382:125312, 2020.
  51. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv:2311.05997, 2023.
  52. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023.
  53. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023.
  54. Droidbot-gpt: Gpt-powered ui automation for android. arXiv preprint arXiv:2304.07061, 2023.
  55. On the road with gpt-4v (ision): Early explorations of visual-language model on autonomous driving. arXiv preprint arXiv:2311.05332, 2023.
  56. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152, 1995.
  57. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  58. Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557, 2023.
  59. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  60. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  61. Travelplanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024.
  62. Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634, 2023.
  63. Instructp2p: Learning to edit 3d point clouds with text instructions. arXiv preprint arXiv:2306.07154, 2023.
  64. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
  65. Octopus: Embodied vision-language programmer from environmental feedback. arXiv preprint arXiv:2310.08588, 2023.
  66. Supervised knowledge makes large language models better in-context learners. arXiv preprint arXiv:2312.15918, 2023.
  67. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752, 2023.
  68. Embodied multi-modal agent trained by an llm from a parallel textworld. arXiv preprint arXiv:2311.16714, 2023.
  69. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023.
  70. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  71. Doraemongpt: Toward understanding dynamic scenes with large language models. arXiv preprint arXiv:2401.08392, 2024.
  72. Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision, pages 341–357. Springer, 2022.
  73. Musicagent: An ai agent for music understanding and generation with large language models. arXiv preprint arXiv:2310.11954, 2023.
  74. Craft: Customizing llms by creating and retrieving from specialized toolsets. arXiv preprint arXiv:2309.17428, 2023.
  75. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436, 2023.
  76. Bootstrap your own skills: Learning to solve new tasks with large language model guidance. arXiv preprint arXiv:2310.10021, 2023.
  77. Loop copilot: Conducting ai ensembles for music generation and iterative editing. arXiv preprint arXiv:2310.12404, 2023.
  78. How do large language models capture the ever-changing world knowledge? a review of recent advances. arXiv preprint arXiv:2310.07343, 2023.
  79. See and think: Embodied agent in virtual environment. arXiv preprint arXiv:2311.15209, 2023.
  80. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. arXiv preprint arXiv:2310.16436, 2023.
  81. Vision language models in autonomous driving and intelligent transportation systems. arXiv preprint arXiv:2310.14414, 2023.
Citations (23)

Summary

  • The paper introduces a taxonomy for large multimodal agents, categorizing current research into four types with key components like perception, planning, action, and memory.
  • It details methodologies including prompt techniques, fine-tuning, and multi-agent collaboration while showcasing applications in robotics, autonomous driving, and GUI automation.
  • The survey emphasizes the need for standardized evaluation methods to benchmark performance and drive future innovations in multimodal LLM systems.

Large Multimodal Agents: A Survey

Introduction

"Large Multimodal Agents: A Survey" focuses on the evolution of LLMs into the multimodal domain, resulting in large multimodal agents (LMAs). These agents are distinguished by their ability to handle tasks involving diverse modalities, such as text, images, and videos. The paper systematically reviews the components, categorizes current research into four types, and highlights the collaborative frameworks integrating LMAs. Furthermore, it emphasizes the necessity for standardized evaluation methods and discusses real-world applications and future directions. Figure 1

Figure 1: Representative research papers from top AI conferences on LLM-powered multimodal agents, categorized by model names.

Core Components of LMAs

The survey identifies four core components integral to LMAs: perception, planning, action, and memory. These components are essential for enabling LMAs to function effectively in complex and dynamic environments.

  1. Perception: This involves processing multimodal information from the environment. Perception techniques have evolved from simple conversion methods to more sophisticated approaches involving sub-task tools for detailed data types. For example, advanced methods extract visual vocabulary and refine it for environmental understanding.
  2. Planning: Central to LMAs, planners utilize LLMs for reasoning and formulating plans. Planning strategies vary from dynamic, where plans are static once set, to static, where plans can be re-evaluated and revised based on feedback. Different models and methods are employed based on task complexity.
  3. Action: Actions are executed based on the plans, translated into tool use, virtual actions, or embodied actions. Approaches include using prompts for action execution or employing learning techniques for action-related data to enhance planners’ capabilities.
  4. Memory: While early LMAs utilized just short memory, modern LMAs incorporate long memory, crucial for handling complex tasks. Memory storage often involves converting multimodal inputs into a format easily retrieved for future planning. Figure 2

    Figure 2: Illustrations of four types of LMAs.

Taxonomy of LMAs

The paper classifies existing LMAs into four types:

  1. Type I: Utilizes closed-source LLMs with prompt techniques, lacking long-term memory.
  2. Type II: Involves fine-tuning open-source models for planning without long-term memory.
  3. Type III: Integrates planners with indirect access to long-term memory via tools.
  4. Type IV: Features planners with direct long-term memory access, bypassing tool mediation.

Each type represents an evolution in handling increasingly complex tasks and environments, from simple settings to dynamic, open-world scenarios.

Multi-agent Collaboration

Collaboration among multiple LMAs is crucial for complex task completion. Frameworks featuring multiple agents distribute responsibilities, enhancing task efficiency and performance. These systems facilitate cooperative strategies, reducing the burden on individual agents, and inherently incorporate memory capabilities for storing collaborative experiences. Figure 3

Figure 3: Illustrations on two types of multi-agent frameworks.

Evaluation

Evaluation remains a challenge, with a need for standardized measures. Existing studies utilize task-specific metrics, but the development of universal benchmarks is essential for comparative evaluations. Subjective assessments involve human evaluations, focusing on versatility, user-friendliness, and value. Objective evaluations rely on well-defined metrics and benchmarks to establish performance standards.

Applications

LMAs exhibit significant versatility across numerous domains:

  • GUI Automation: Simulating human interface interactions to streamline workflows.
  • Robotics and Embodied AI: Enhancing physical interaction capabilities in dynamic environments.
  • Game Development: Creating intelligent, interactive virtual agents.
  • Autonomous Driving: Advancing vehicles' ability to perceive and adapt to complex environments.
  • Video Understanding: Facilitating advanced multimedia content analysis.
  • Visual Generation and Editing: Enabling creative visual projects through automation.
  • Complex Visual Reasoning Tasks: Expanding cognitive capacity for nuanced tasks involving multimodal data.
  • Audio Editing and Generation: Efficiently managing multimedia content for creative purposes. Figure 4

    Figure 4: A variety of applications of LMAs.

Conclusion

The survey concludes by highlighting the need for more unified frameworks and systematic evaluation methods to propel LMAs further toward real-world applicability. The potential for LMAs in diverse applications exemplifies the transformative capability of integrating advanced LLM-driven models across various multimodal domains. Continued research is encouraged to refine these systems, enhance their adaptability, and explore new application avenues in human-computer interaction and beyond.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com