Emergent Mind

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

(2204.01691)

Published Apr 4, 2022 in cs.RO , cs.CL , and cs.LG

Abstract

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with LLMs so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.

Overview

Introduces a method called SayCan that grounds LLMs in the physical world, enabling robots to follow complex instructions given in natural language.
SayCan combines LLM's semantic understanding with robotic affordances through pretrained skills, allowing for the execution of feasible actions based on the robot's capabilities and environment context.
The approach was validated on 101 real-world robotic tasks, showing a significant improvement in task completion rates over non-grounded baselines.
It outlines future research directions aimed at enhancing robotic skills, refining grounding techniques, and exploring bidirectional learning between robots and LLMs.

Grounding Language in Robotic Affordances through Pretrained Skills

Introduction

LLMs have shown remarkable capabilities in understanding and generating natural language. However, their application to robotic tasks poses significant challenges due to their lack of understanding of the physical world and the actions that can be executed within it. The paper introduces a novel approach to bridging this gap by grounding LLMs in the physical world through the use of pretrained skills. This method, referred to as SayCan, enables robots to follow high-level, abstract instructions in natural language by combining the semantic understanding of LLMs with the real-world interaction capabilities of robots.

Methodology: SayCan

SayCan leverages the semantic knowledge encoded in LLMs and grounds it with the affordance of physical actions available to a robot. The process involves two key components:

Task Grounding with LLMs: This involves using LLMs to understand high-level instructions and breaking them down into feasible actions that a robot can understand and execute.
World Grounding with Pretrained Skills: This involves associating each action with a value function that quantifies its feasibility given the robot's current state and environment. This grounding ensures that the robot only attempts actions that are possible and sensible given its capabilities and the context of the environment.

By combining these components, SayCan allows a robot to interpret complex instructions, decide on a sequence of actions that can achieve the given task, and execute these actions in the real world.

Evaluation

The approach was evaluated on a set of 101 real-world robotic tasks, demonstrating its ability to execute long-horizon, abstract instructions with a high degree of success. The evaluation showed a significant improvement in task completion rates compared to non-grounded baselines, validating the necessity of grounding both in task understanding and in the physical world for successful task execution by robots.

Implications and Future Directions

SayCan presents significant advancements in integrating the semantic knowledge of LLMs with the physical execution capabilities of robots. The approach raises important considerations for future research in robotics and AI, particularly in improving the interaction between high-level language understanding and low-level action execution. Future work may explore:

Enhancing Skill Repertoires: Expanding the range of skills robots can learn and perform would increase the versatility and applicability of this method across various domains.
Improving Grounding Techniques: Refining how actions are grounded in the physical world could lead to more nuanced and context-aware robot behavior.
Bidirectional Learning: Investigating how real-world interactions can feedback into LLMs to improve their understanding of the physical world and action consequences.

Conclusion

SayCan represents a promising direction in leveraging the vast semantic knowledge of LLMs for robotic task execution. By grounding language in the affordances of the physical world, this approach enables robots to perform complex, temporally extended tasks based solely on high-level natural language instructions. This research paves the way for more intuitive and effective human-robot interaction, where communicating complex tasks can be as simple as speaking naturally.