An Embodied Generalist Agent in 3D World (2311.12871v3)

Published 18 Nov 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Leveraging massive knowledge from LLMs, recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e.g., 3D grounding, embodied reasoning and acting. We argue these limitations significantly hinder current models from performing real-world tasks and approaching general intelligence. To this end, we introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. LEO is trained with a unified task interface, model architecture, and objective in two stages: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Moreover, we meticulously design an LLM-assisted pipeline to produce high-quality 3D VL data. Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation. Our ablative studies and scaling analyses further provide valuable insights for developing future embodied generalist agents. Code and data are available on project page.

Citations (73)

View on Semantic Scholar

Summary

The paper presents LEO as a unified agent that integrates 3D vision, language, and action for executing diverse tasks.
LEO employs a two-stage training method combining 3D vision-language alignment and vision-language-action instruction tuning to enhance task performance.
Evaluation reveals LEO’s state-of-the-art results in 3D captioning, question answering, navigation, dialogue, and robotic manipulation tasks.

Overview of "An Embodied Generalist Agent in 3D World"

The paper presents a multi-modal and multi-task generalist agent, LEO, designed for comprehensive understanding and interaction in 3D environments. This work aims to bridge the gap in capabilities between existing general-purpose models and the requirements for real-world 3D task performance. LEO is introduced as a solution to the challenge of creating models that are proficient in perceiving, grounding, reasoning, planning, and acting within a 3D world.

Training Methodology

LEO's training process is divided into two stages:

3D Vision-Language Alignment (LEO-align): This stage focuses on aligning 3D scene representations with natural language. It involves training the model on tasks like object-level captioning, object referring in scenes, and scene-level captioning. A curated dataset from Objaverse, ScanNet, and 3RScan is utilized, encapsulating various object and scene details.
3D Vision-Language-Action Instruction Tuning (LEO-instruct): The second stage endows LEO with generalist capabilities for a variety of 3D tasks, such as 3D captioning, question answering, dialogue, task planning, navigation, and robotic manipulation. The training dataset is significantly expanded through meticulous curation and LLM-assisted data generation, particularly leveraging scene graphs and refinement processes to ensure quality.

Model Architecture

LEO leverages a decoder-only LLM with embeddings for egocentric 2D images, object-centric 3D point clouds, and text. The architectural decisions prioritize unified processing across modalities using spatial transformers for capturing 3D relations, integrated with LLMs fine-tuned through LoRA. This enables the model to perform task-agnostic sequence predictions.

Evaluation and Results

LEO was rigorously tested on:

3D Captioning (e.g., Scan2Cap)
3D Question Answering (e.g., ScanQA)
Embodied Reasoning (e.g., SQA3D)
Scene-aware Dialogue and Planning
Embodied Navigation (on Habitat)
Robotic Manipulation (on CLIPort tasks)

The model demonstrated state-of-the-art performance across these tasks, indicating its proficiency in diverse domains. For instance, in dense 3D captioning tasks, LEO outperformed existing state-of-the-art models. In 3D question answering, it achieved significant accuracy improvements, reflecting its robust understanding and reasoning capabilities.

Implications and Future Directions

The development of LEO marks a critical step towards embodied generalist agents capable of integrating advanced perception, language processing, and action planning into a cohesive system. The implications of this work are substantial, as such agents could facilitate various real-world applications, from autonomous robotics to advanced human-computer interaction systems.

Theoretical Implications: The work supports the hypothesis that a unified model can effectively handle multi-modal, multi-task learning by integrating various forms of visual and textual data. This challenges the need for task-specific architectures, promoting a more generalist approach in model design.

Practical Implications: Practically, LEO’s capabilities could be extended to real-world robotics, enhancing autonomous systems in complex environments. Furthermore, the approach to dataset generation and refinement provides a framework that can be replicated for other domains requiring comprehensive multi-modal understanding.

Future Work: Future research could focus on scaling the model to incorporate more diverse and larger-scale 3D datasets. Additionally, exploring the integration of more sophisticated policy architectures for embodied tasks, such as recurrent models for navigation, could enhance performance further. Investigating safety and alignment issues within the context of embodied AI is another crucial area, especially as these models become more integral to real-world applications.

Conclusion

The introduction of LEO, an embodied generalist agent with advanced 3D world interaction capacities, marks a significant advancement in AI. This paper provides comprehensive insights into training methodologies, model architecture, and the extensive evaluation of LEO, establishing new benchmarks and opening pathways for future research in embodied AI. The findings underscore the potential of such agents to transform how AI interfaces with complex real-world environments.

PDF Markdown