Emergent Mind

OpenVLA: An Open-Source Vision-Language-Action Model

(2406.09246)
Published Jun 13, 2024 in cs.RO and cs.LG

Abstract

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

OpenVLA model predicts 7D robot control actions from image and language input using a vision encoder.

Overview

  • The paper introduces OpenVLA, a 7B-parameter vision-language-action model for robotic control, trained on a massive dataset of 970,000 robot demonstrations to enhance generalist robot manipulation policies.

  • OpenVLA integrates a Llama 2 language model with visual encoders DINOv2 and SigLIP, demonstrating superior performance over larger models like RT-2-X, and excels in fine-tuning on consumer-grade GPUs using methods like low-rank adaptation (LoRA) and quantization.

  • Evaluations on various dimensions such as visual, motion, physical, and semantic generalization, using platforms like BridgeData V2 WidowX robot and Google mobile manipulation robot, validate OpenVLA’s robustness and efficiency, setting a new standard in robotic manipulation and enabling broader application through its open-source nature.

Insightful Overview of "OpenVLA: An Open-Source Vision-Language-Action Model"

The paper "OpenVLA: An Open-Source Vision-Language-Action Model" introduces OpenVLA, a 7B-parameter vision-language-action (VLA) model for robotic control. Trained on a substantial dataset of 970,000 robot demonstrations (referred to as the Open X-Embodiment dataset), OpenVLA represents a significant step towards versatile, generalist robot manipulation policies. The paper methodically tackles two main challenges: the need for open accessibility and the efficient fine-tuning of VLA models for new tasks. The results demonstrate that OpenVLA achieves superior performance over existing models while being an order of magnitude more efficient in terms of parameters.

Key Contributions and Results

Model Architecture and Training: OpenVLA integrates a Llama 2 language model with visual encoders from DINOv2 and SigLIP, capturing visual features at multiple granularities. A core component of OpenVLA's success is its extensive training on diverse robotic manipulation trajectories across multiple robot embodiments. This diverse dataset includes tasks and environments that promote generalization to new object appearances, arrangements, and instructions.

Performance Metrics: OpenVLA outperforms the 55B-parameter RT-2-X model, achieving a 16.5% absolute improvement in task success rate across 29 tasks and multiple robots, despite using seven times fewer parameters. Additionally, it showed notable fine-tuning capabilities, outperforming methods like Diffusion Policy by 20.4% in multi-task environments involving language grounding and object manipulation. Noteworthy is OpenVLA's capability to fine-tune efficiently on consumer-grade GPUs without compromising task success, thanks to methods like low-rank adaptation (LoRA) and quantization.

Evaluation on Multiple Axes

The evaluations span several dimensions of generalization, including visual (unseen object appearances and backgrounds), motion (novel object positions and orientations), physical (different object shapes and sizes), and semantic (unseen task instructions and concepts). These evaluations were conducted on two key platforms: the BridgeData V2 WidowX robot and the Google mobile manipulation robot.

BridgeData V2 Evaluations: OpenVLA's performance on the BridgeData V2 tasks emphasizes its robustness in handling visual and semantic generalizations with multiple objects and complex task dynamics. The model significantly surpasses RT-2-X, RT-1-X, and Octo in these evaluations, evidencing the benefits of the diverse training dataset and the fused vision encoder components.

Practical Efficiency: OpenVLA demonstrates practical efficiency not only in terms of parameter count and resource requirements for training and inference but also in its ability to be adapted rapidly to new setups. The integration of low-bit quantization techniques and LoRA fine-tuning allows OpenVLA to function effectively on consumer-grade hardware. This represents a considerable advantage for real-world deployment scenarios where access to high-end computational resources may be restricted.

Implications and Future Directions

OpenVLA sets a new standard in the field of robotic manipulation by combining the strengths of large-scale internet-pretrained vision and language models with robust, diverse robot demonstration datasets. Its open-source nature paves the way for broader research and application, potentially accelerating advancements in multi-robot coordination, fine-tuning strategies, and the deployment of sophisticated robotics systems in varying environments.

Future Research Directions:

  • Extending Sensory Inputs: Future iterations might include multiple sensory modalities such as proprioceptive data, extending beyond single-image observations to encompass more comprehensive state representations.
  • Higher Frequency Control: Improvements in the inference speed of OpenVLA are critical for adapting it to high-frequency control systems, enabling more complex and precise manipulation tasks.
  • Model Architecture and Dataset Diversification: Investigations into the impact of larger base VLMs, co-training strategies on mixed datasets, and exploring different visual features could further enhance the model’s robustness and versatility.

In conclusion, OpenVLA significantly contributes to the field by addressing key limitations of previous models, offering superior performance, practical efficiency, and paving the way for future advancements in robotic control through open-source collaboration and innovation.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube