SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning (2307.06135v2)

Published 12 Jul 2023 in cs.RO and cs.AI

Abstract: LLMs have demonstrated impressive results in developing generalist planning agents for diverse tasks. However, grounding these plans in expansive, multi-floor, and multi-room environments presents a significant challenge for robotics. We introduce SayPlan, a scalable approach to LLM-based, large-scale task planning for robotics using 3D scene graph (3DSG) representations. To ensure the scalability of our approach, we: (1) exploit the hierarchical nature of 3DSGs to allow LLMs to conduct a 'semantic search' for task-relevant subgraphs from a smaller, collapsed representation of the full graph; (2) reduce the planning horizon for the LLM by integrating a classical path planner and (3) introduce an 'iterative replanning' pipeline that refines the initial plan using feedback from a scene graph simulator, correcting infeasible actions and avoiding planning failures. We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects and show that our approach is capable of grounding large-scale, long-horizon task plans from abstract, and natural language instruction for a mobile manipulator robot to execute. We provide real robot video demonstrations on our project page https://sayplan.github.io.

Citations (166)

View on Semantic Scholar

Summary

The paper introduces SayPlan, which grounds LLM-generated plans in 3D scene graphs to enable scalable robot task planning in complex environments.
It leverages hierarchical scene graphs and semantic subgraph search to minimize token usage while ensuring plans adhere to physical constraints.
An iterative replanning pipeline integrated with classical path planners boosts plan executability, achieving 86.7% alignment with human reasoning in tests.

Overview of #SayPlan: Grounding LLMs using 3D Scene Graphs for Scalable Robot Task Planning

This paper presents #SayPlan, an innovative framework designed to address significant challenges in robotic task planning over large multi-floor and multi-room environments by utilizing LLMs grounded through 3D scene graphs (3DSGs). The primary aim is to scale up the application of LLMs in robotics, overcoming the limitations faced by traditional approaches when trying to navigate complex environments. The core emphasis is on ensuring that the plans generated by LLMs are both feasible and grounded in the given physical environment.

Innovations in SayPlan

The authors introduce several key approaches to enhance scalability and efficacy:

Hierarchical 3D Scene Graphs: By leveraging 3DSGs, SayPlan is able to perform a hierarchical abstraction of the environment, allowing the LLM to operate with a semantic understanding of spatial components while remaining within the token limits of LLMs.
Semantic Search for Subgraphs: A novel semantic search mechanism is employed, which permits the LLM to explore task-relevant subgraphs from a collapsed 3DSG representation. This strategy not only maintains a low token footprint but also focuses the model's attention on smaller segments of the graph necessary for task completion.
Integration with Classical Path Planners: To prevent hallucinations and infeasible sequences, the framework delegates navigational plan components to a classical path planner, thereby allowing the LLM to concentrate on generating action-oriented plans over shorter horizons.
Iterative Replanning Pipeline: SayPlan implements an iterative cycle of planning, verification, and replanning. Feedback from a scene graph simulator is used to iteratively refine and verify the plans, ensuring high executability and adherence to environmental constraints.

Experimental Validation

The approach was validated on two complex environments: an office floor with 37 rooms and a multi-story house with various interactive tasks. The paper showcased SayPlan's ability to handle 90 distinct tasks designed to test semantic search capacity and causal planning competence. The semantic search evaluation revealed a significant performance advantage of GPT-4 over GPT-3.5, demonstrating the system's alignment with human reasoning processes in about 86.7% of simple search tasks. Furthermore, SayPlan's iterative replanning process substantially increased the executability of plans in long-horizon tasks.

Implications and Future Directions

The introduction of #SayPlan suggests a promising direction in robot task planning literature, particularly highlighting the efficient use of LLMs and semantic graphs in managing extensive and varied environments. The approach lays groundwork for integrating ongoing research in 3D scene graph representations and LLM-enhanced planning. However, challenges such as dynamic object interaction, real-time updates to scene graphs, and extending this model's application beyond static environments remain. Future studies may benefit from addressing these challenges and exploring more sophisticated graph reasoning capabilities or incorporating online scene graph generation.

Overall, #SayPlan provides an essential framework for improving large-scale robotic planning, heralding its potential utility across diverse real-world applications such as home automation, healthcare robotics, and collaborative team-based environments.