Emergent Mind

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents

(2401.12963)
Published Jan 23, 2024 in cs.RO , cs.AI , cs.CL , cs.CV , and cs.LG

Abstract

Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses LLMs for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.

AutoRT system diagram: robots explore and sample navigation targets, VLM and LLM generate tasks without prior layout knowledge.

Overview

  • AutoRT is designed to autonomously orchestrate multiple robots for large-scale data acquisition in diverse environments.

  • The system builds upon the integration of LLMs and VLMs for task execution and uses open vocabulary object detectors for scene comprehension.

  • Robots operate with varying degrees of autonomy, from full independence to teleoperation, guided by the Robot Constitution.

  • Experimental results demonstrate that AutoRT allows one human to supervise up to five robots, increasing data diversity and deployment capabilities.

  • While effective, AutoRT's limitations include real-world interaction complexity and the need for future research on policy improvement and sparse data handling.

Introduction

Autonomous robotics research is on a progressive path towards creating robotic agents that function without human intervention for a broad range of tasks. Despite the strides accomplished with robotic learning methods and the integration of advanced LLMs and VLMs, an autonomous system's capability to infer and perform open-ended tasks in various environments remains a significant hurdle. The crux of this challenge is data scarcity; operating in the physical world requires extensive real-world data experiences that surpass what is captured in controlled lab environments. The paper at hand introduces AutoRT, crafted to surmount this obstacle, aiming to autonomously orchestrate a fleet of robots for large-scale data acquisition.

Related Work

AutoRT builds upon prior work in autonomous data collection and the latest breakthroughs in language and vision models. Traditionally, autonomous data collection involved lab-bounded tasks, but more varied and less structured environments are now being considered. Teleoperated data collection remains valuable due to its diverse nature but is constrained by human resource limitations, setting the stage for a hybrid approach. Incorporating LLMs in agent behavior has been explored in other contexts, with this research being a direct progressive step that allows LLM-driven robots to execute auto-proposed goals in the real world.

Problem Statement

The work delineates the premise of a system designed for vast "in-the-wild" robotic data collection. With multiple robots navigating divergent environments, the aim is to produce a diverse and operationally efficient data accumulation mechanism. Addressing the constraints imposed by limited human supervision, each robot needs to be equipped to interpret its surrounding state and execute tasks through a spectrum of policies varying from full autonomy to teleoperation, ensuring wide-ranging data collection while managing tradeoffs between independence and safety.

AutoRT: Exploring and Executing in the Wild

AutoRT is a complex system utilizing a foundational model to choreograph robots by integrating user directives with environmental observables, generating tasks, and then carrying out the assigned tasks. Core components include an open vocabulary object detector for scene comprehension, LLMs used for task suggestion in line with high-level goals, and execution modalities determined by LLMs as well. A standout feature is the Robot Constitution – a set of operational guideposts – that enables the robots to interpret foundational rules, safety measures, and embodiment limitations in their task handling. Extensive experimentation validates that AutoRT collected data is vastly more diverse and the system's ability to align robot actions with human preferences. The system demonstrated allows a single human to supervise up to five robots, significantly amplifying deployment capabilities and diversity of autonomously collected data.

Conclusion and Future Work

The presented AutoRT system exemplifies an innovative move towards autonomous, large-scale, and "in-the-wild" data acquisition by robotic systems. The orchestrated approach has yielded a substantial corpus of diverse, real-world robotic interactions. Despite its proven efficacy, AutoRT is not without its limitations, primarily rooted in the complexity of real-world interactions, diversity of data, and the necessity of human oversight. Going forward, integrating robotic data collection more tightly with policy improvement and addressing the system's ability to handle "sparse" data represent pivotal areas for further research. The paper suggests fascinating future inquiries into how robots autonomously interface with our environment and points towards a future where robotic data collection could parallel the extensive scope of foundation models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube