AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents (2401.12963v2)

Published 23 Jan 2024 in cs.RO, cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-LLMs (VLMs) for scene understanding and grounding, and further uses LLMs for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such "in-the-wild" data collected by AutoRT is significantly more diverse, and that AutoRT's use of LLMs allows for instruction following data collection robots that can align to human preferences.

References (41)

Citations (33)

View on Semantic Scholar

Summary

The paper presents AutoRT, a novel system leveraging LLMs to manage robotic fleets for autonomous, diverse in-the-wild data acquisition.
It integrates open vocabulary object detectors, LLM-based task suggestion, and a Robot Constitution that governs safe robot operations.
The system allows one human to supervise up to five robots, significantly boosting data diversity and deployment scalability.

Introduction

Autonomous robotics research is on a progressive path towards creating robotic agents that function without human intervention for a broad range of tasks. Despite the strides accomplished with robotic learning methods and the integration of advanced LLMs and VLMs, an autonomous system's capability to infer and perform open-ended tasks in various environments remains a significant hurdle. The crux of this challenge is data scarcity; operating in the physical world requires extensive real-world data experiences that surpass what is captured in controlled lab environments. The paper at hand introduces AutoRT, crafted to surmount this obstacle, aiming to autonomously orchestrate a fleet of robots for large-scale data acquisition.

AutoRT builds upon prior work in autonomous data collection and the latest breakthroughs in language and vision models. Traditionally, autonomous data collection involved lab-bounded tasks, but more varied and less structured environments are now being considered. Teleoperated data collection remains valuable due to its diverse nature but is constrained by human resource limitations, setting the stage for a hybrid approach. Incorporating LLMs in agent behavior has been explored in other contexts, with this research being a direct progressive step that allows LLM-driven robots to execute auto-proposed goals in the real world.

Problem Statement

The work delineates the premise of a system designed for vast "in-the-wild" robotic data collection. With multiple robots navigating divergent environments, the aim is to produce a diverse and operationally efficient data accumulation mechanism. Addressing the constraints imposed by limited human supervision, each robot needs to be equipped to interpret its surrounding state and execute tasks through a spectrum of policies varying from full autonomy to teleoperation, ensuring wide-ranging data collection while managing tradeoffs between independence and safety.

AutoRT: Exploring and Executing in the Wild

AutoRT is a complex system utilizing a foundational model to choreograph robots by integrating user directives with environmental observables, generating tasks, and then carrying out the assigned tasks. Core components include an open vocabulary object detector for scene comprehension, LLMs used for task suggestion in line with high-level goals, and execution modalities determined by LLMs as well. A standout feature is the Robot Constitution – a set of operational guideposts – that enables the robots to interpret foundational rules, safety measures, and embodiment limitations in their task handling. Extensive experimentation validates that AutoRT collected data is vastly more diverse and the system's ability to align robot actions with human preferences. The system demonstrated allows a single human to supervise up to five robots, significantly amplifying deployment capabilities and diversity of autonomously collected data.

Conclusion and Future Work

The presented AutoRT system exemplifies an innovative move towards autonomous, large-scale, and "in-the-wild" data acquisition by robotic systems. The orchestrated approach has yielded a substantial corpus of diverse, real-world robotic interactions. Despite its proven efficacy, AutoRT is not without its limitations, primarily rooted in the complexity of real-world interactions, diversity of data, and the necessity of human oversight. Going forward, integrating robotic data collection more tightly with policy improvement and addressing the system's ability to handle "sparse" data represent pivotal areas for further research. The paper suggests fascinating future inquiries into how robots autonomously interface with our environment and points towards a future where robotic data collection could parallel the extensive scope of foundation models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1749996457329840321

https://twitter.com/xiao_ted/status/1750262908951654877

https://twitter.com/fly51fly/status/1750270059493335502

https://twitter.com/Vide_audi_sile/status/1793874894591070576

https://twitter.com/gm8xx8/status/1749976130038218762

https://twitter.com/TeemuMtt3/status/1750300345887039547

YouTube

Show All Videos