Emergent Mind

Octo: An Open-Source Generalist Robot Policy

(2405.12213)
Published May 20, 2024 in cs.RO and cs.LG

Abstract

Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.

Octo's architecture: tokenization of tasks and observations, transformer processing, and dynamic input/output finetuning.

Overview

  • Octo is an open-source, transformer-based robot policy pre-trained on the largest robot manipulation dataset, designed to handle diverse robotic tasks with versatility, flexibility, and scalability.

  • Octo's architecture includes task and observation tokenizers, a transformer backbone, and readout heads, enabling it to map various inputs to actions and adapt to different sensors and robot configurations without large-scale retraining.

  • Experimental results show that Octo significantly outperforms previous state-of-the-art models in zero-shot and finetuning performance, demonstrating notable success rates in a variety of robotic tasks, and highlighting its efficacy and potential for reducing data collection needs.

Octo: An Open-Source Generalist Robot Policy

Introduction

Robotics is an expanding field, and the hope of having versatile robots capable of performing a multitude of tasks out-of-the-box is becoming more realistic. The paper on Octo, an Open-Source Generalist Robot Policy, makes strides toward realizing this vision by introducing a transformer-based policy pre-trained on a massive dataset. This open-source policy is designed to be both adaptable and robust, making it feasible for various robotic learning scenarios. Let's break down what makes this so interesting, and what the implications could be.

What is Octo?

Octo is a large, transformer-based policy designed for robot manipulation. It is pre-trained on 800,000 robot trajectories from the largest robot manipulation dataset to date, the Open X-Embodiment dataset. The policy can handle diverse inputs and outputs, and it's flexible: it can be adapted to various robots with different sensory inputs and action spaces.

Key Features:

  1. Versatility: Designed to work with multiple types of robots and sensors.
  2. Flexibility: Can be fine-tuned for new tasks and environments quickly.
  3. Scalability: Built using a transformer architecture, it scales well with data.
  4. Open Source: Fully open source, including weights, scripts, and dataset.

Architectural Overview

At its core, Octo uses a transformer to map inputs (like language instructions or goal images) and observations (like camera streams) to actions. The architecture is divided into three main parts:

  1. Task and Observation Tokenizers: These convert task descriptions and observations into tokens.
  2. Transformer Backbone: Processes these tokens to produce embeddings.
  3. Readout Heads: Convert embeddings into actions.

The key here is flexibility: by using a sequence of tokenized inputs, Octo can be adapted to different sensors and robot configurations without retraining large parts of the model.

Training Details

Let's talk training. Octo was trained using 25 datasets from the Open X-Embodiment collection, with key considerations to balance dataset size and diversity. The model uses a diffusion-based action prediction, which allows it to generate precise and accurate actions.

Hyperparameters and Specs:

  • Transformer Type: Similar to ViT (Vision Transformer)
  • Training Dataset: 800k trajectories
  • Batch Size: 2048
  • Training Time: 14 hours on a TPU v4-128 pod

Experimental Results

Zero-Shot Performance

Octo was tested on several tasks across different robot setups immediately after pre-training, without any additional task-specific training:

  • Success Rate: On average, Octo had a 29% higher success rate compared to RT-1-X, a previous state-of-the-art openly available generalist policy.
  • Task Examples: Tasks varied from tabletop picking and placing to more complex tasks like opening drawers.

RT-2-X, another competitor with a 55 billion parameter model, was also tested. Octo performed similarly, demonstrating the efficiency of its architecture despite being a more lightweight model.

Finetuning Performance

One major capability of Octo is its flexibility for finetuning to new setups. Octo was finetuned for different robotic tasks using an average of 100 target demonstrations across various domains:

  • Finetuning Time: Less than 5 hours on an NVIDIA A5000 GPU.
  • Success Rates: On average, Octo outperformed baseline methods (both from scratch and using pre-trained visual representations) by 52%.

Tasks included, but were not limited to:

  • Precision Handling: Tasks like peg insertion requiring force/torque inputs.
  • Novel Robot Control: Adapting to new robots not included in pre-training.

Implications and Speculations

The introduction of Octo presents several potential impacts on both practical and theoretical fronts:

  1. Practical Applications:

    • Reduced Data Collection Needs: Fine-tuning large pre-trained models can significantly cut down the amount of new data needed for training.
    • Efficient Multitasking: Versatile models like Octo could be deployed in scenarios requiring a multitude of tasks without needing extensive reconfiguration or retraining.
  2. Theoretical Insights:

    • Scalable Training: The success of transformer architectures in robotic policies could inspire re-evaluation of traditional policy architectures.
    • Generalization: This work demonstrates how large-scale pre-training can help in generalizing to new tasks and setups, an area to be further explored.

Future Directions

While Octo is a substantial step forward, there are areas ripe for further development:

  • Enhanced Modalities: Improving wrist camera and proprioceptive input integration.
  • Larger Data Sets: Incorporating more diverse and larger datasets could potentially yield even more robust policies.
  • Broader Robot Varieties: Expanding beyond single and dual-arm manipulators to encompass mobile robots and other configurations.

Conclusion

Octo represents a significant advancement in creating versatile, adaptable robot policies. With its open-source nature, it provides a valuable resource for the robotics community to build upon, fostering further innovation and practical applications.

For more details and to access the Octo model and resources, you can visit their website.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube