Emergent Mind

RVT-2: Learning Precise Manipulation from Few Demonstrations

(2406.08545)
Published Jun 12, 2024 in cs.RO , cs.AI , and cs.CV

Abstract

In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high precision. We study how to make them more effective, precise, and fast. Using a combination of architectural and system-level improvements, we propose RVT-2, a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT. RVT-2 achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%. RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations. Visual results, code, and trained model are provided at: https://robotic-view-transformer-2.github.io/.

RVT-2 model performing millimeter-level precision 3D manipulation tasks based on language instructions.

Overview

  • RVT-2 introduces a multitask 3D manipulation model which achieves high precision with minimal training data, using approximately ten demonstrations per task and a single third-person RGB-D camera.

  • Key architectural features of RVT-2 include a two-stage design for precise manipulation, convex upsampling for high-resolution heatmaps, and optimized parameters for GPU efficiency.

  • The system-level enhancements comprise a custom CUDA-based projection renderer, mixed precision training, and efficient GPU implementations of attention layers, leading to faster training and robust performance in both simulation and real-world settings.

RVT-2: Learning Precise Manipulation from Few Demonstrations

The paper titled RVT-2: Learning Precise Manipulation from Few Demonstrations by Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox presents significant advancements in the domain of robotic manipulation. It introduces RVT-2, a multitask 3D manipulation model designed to achieve high precision with minimal training data, specifically leveraging approximately ten demonstrations per task and using a single third-person RGB-D camera.

Introduction

The core objective of the study is to develop a robotic system capable of solving various 3D manipulation tasks guided by language instructions. The system is expected to exhibit high precision and learn new tasks with limited demonstrations. Previous models like PerAct and RVT have made strides in this direction but fall short on tasks demanding high precision. RVT-2 addresses these shortcomings by improving upon RVT's architecture and introducing several system-level enhancements.

Architectural Improvements

The RVT-2 model is characterized by several architectural innovations:

  1. Multi-Stage Design: RVT-2 employs a two-stage approach for precise manipulation tasks. Initially, it predicts the area of interest using coarse views and subsequently uses zoomed-in views for meticulous pose prediction, thereby enhancing precision.
  2. Convex Upsampling: To optimize memory usage and maintain high performance, RVT-2 replaces traditional transposed convolutions with convex upsampling, which directly predicts high-resolution heatmaps from token-level features.
  3. Parameter Rationalization: The model parameters are optimized for GPU efficiency, adopting image and patch sizes conducive to faster processing without degrading task performance.
  4. Location-Conditioned Rotation Prediction: RVT-2 improves upon RVT by using local feature conditioning for predicting end-effector rotations, facilitating more accurate manipulation in complex scenes with multiple valid end-effector locations.
  5. Reduced Virtual Views: The use of only three virtual views (front, top, and right) in a multi-stage setup is sufficient, reducing computational load while maintaining performance.

System-Related Enhancements

In addition to architectural improvements, RVT-2 benefits from several system-level optimizations:

  1. Custom Point-Renderer: A tailored CUDA-based projection renderer replaces the generic PyTorch3D renderer, enhancing rendering efficiency and memory usage.
  2. Optimized Training Pipeline: The adoption of mixed precision training, 8-bit LAMB optimizer, and efficient GPU implementations of attention layers contribute to a significantly faster training process.

Experimental Results

RVT-2's performance was robustly evaluated in both simulation and real-world settings. In simulation, RVT-2 exhibited state-of-the-art performance on the RLBench benchmark, attaining an average success rate of 81.4%, surpassing prior methods. The model demonstrated its efficiency gains by training 6X faster than its predecessor, RVT, and operating at 20.6 fps during inference.

Real-World Evaluation:

  • Tasks included stacking, inserting pegs, and plug insertion.
  • RVT-2 trained with only ten demonstrations per task achieved superior accuracy, showcasing its potential for precise manipulation with minimal training.

Implications

Practical Implications: RVT-2 presents considerable advancements for industrial and household robotic applications where precise manipulation is critical. Its ability to learn from a few demonstrations makes it highly adaptable for environments with varying task requirements.

Theoretical Implications: The architectural and system-level improvements provide a comprehensive framework for developing efficient and precise multitask manipulation models, potentially influencing future research in robotic manipulation and control architectures.

Future Directions

RVT-2 achieves substantial progress but also identifies areas for further research:

  • Enhancing generalization to unseen object instances.
  • Incorporating additional sensory data (e.g., force feedback) for fine-grained adjustments.
  • Addressing the instability observed in multi-task optimization to maintain consistent task performance throughout training.

Conclusion

The advancements encapsulated in RVT-2 significantly propel the capabilities of robotic manipulation systems. It merges architectural sophistication with system efficiency, setting a new benchmark in multitask 3D manipulation, particularly for tasks requiring high precision and few-shot learning. The findings and methodologies presented in this work will likely inspire subsequent innovations in robotic manipulation research.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.