Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with LLMs so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.
Introduces a method called SayCan that grounds LLMs in the physical world, enabling robots to follow complex instructions given in natural language.
SayCan combines LLM's semantic understanding with robotic affordances through pretrained skills, allowing for the execution of feasible actions based on the robot's capabilities and environment context.
The approach was validated on 101 real-world robotic tasks, showing a significant improvement in task completion rates over non-grounded baselines.
It outlines future research directions aimed at enhancing robotic skills, refining grounding techniques, and exploring bidirectional learning between robots and LLMs.
LLMs have shown remarkable capabilities in understanding and generating natural language. However, their application to robotic tasks poses significant challenges due to their lack of understanding of the physical world and the actions that can be executed within it. The paper introduces a novel approach to bridging this gap by grounding LLMs in the physical world through the use of pretrained skills. This method, referred to as SayCan, enables robots to follow high-level, abstract instructions in natural language by combining the semantic understanding of LLMs with the real-world interaction capabilities of robots.
SayCan leverages the semantic knowledge encoded in LLMs and grounds it with the affordance of physical actions available to a robot. The process involves two key components:
By combining these components, SayCan allows a robot to interpret complex instructions, decide on a sequence of actions that can achieve the given task, and execute these actions in the real world.
The approach was evaluated on a set of 101 real-world robotic tasks, demonstrating its ability to execute long-horizon, abstract instructions with a high degree of success. The evaluation showed a significant improvement in task completion rates compared to non-grounded baselines, validating the necessity of grounding both in task understanding and in the physical world for successful task execution by robots.
SayCan presents significant advancements in integrating the semantic knowledge of LLMs with the physical execution capabilities of robots. The approach raises important considerations for future research in robotics and AI, particularly in improving the interaction between high-level language understanding and low-level action execution. Future work may explore:
SayCan represents a promising direction in leveraging the vast semantic knowledge of LLMs for robotic task execution. By grounding language in the affordances of the physical world, this approach enables robots to perform complex, temporally extended tasks based solely on high-level natural language instructions. This research paves the way for more intuitive and effective human-robot interaction, where communicating complex tasks can be as simple as speaking naturally.