OpenVLA: An Open-Source Vision-Language-Action Model (2406.09246v3)

Published 13 Jun 2024 in cs.RO and cs.LG

Abstract: Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 LLM combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

Authors (18)

Moo Jin Kim (9 papers)
Karl Pertsch (35 papers)
Siddharth Karamcheti (26 papers)
Ted Xiao (40 papers)
Ashwin Balakrishna (40 papers)
Suraj Nair (39 papers)
Rafael Rafailov (37 papers)
Ethan Foster (2 papers)
Grace Lam (5 papers)
Pannag Sanketi (13 papers)
Quan Vuong (41 papers)
Thomas Kollar (27 papers)
Benjamin Burchfiel (19 papers)
Russ Tedrake (91 papers)
Dorsa Sadigh (162 papers)
Sergey Levine (531 papers)
Percy Liang (239 papers)
Chelsea Finn (264 papers)

Citations (122)

View on Semantic Scholar

Summary

The paper introduces a 7B-parameter vision-language-action model that achieves a 16.5% improvement in task success rate over larger models.
It fuses Llama 2 language with DINOv2 and SigLIP vision encoders, trained on 970,000 diverse robot demonstrations to generalize across complex tasks.
Through efficient fine-tuning and low-bit quantization, OpenVLA enables practical deployment on consumer-grade GPUs, advancing robotic manipulation.

Insightful Overview of "OpenVLA: An Open-Source Vision-Language-Action Model"

The paper "OpenVLA: An Open-Source Vision-Language-Action Model" introduces OpenVLA, a 7B-parameter vision-language-action (VLA) model for robotic control. Trained on a substantial dataset of 970,000 robot demonstrations (referred to as the Open X-Embodiment dataset), OpenVLA represents a significant step towards versatile, generalist robot manipulation policies. The paper methodically tackles two main challenges: the need for open accessibility and the efficient fine-tuning of VLA models for new tasks. The results demonstrate that OpenVLA achieves superior performance over existing models while being an order of magnitude more efficient in terms of parameters.

Key Contributions and Results

Model Architecture and Training:

OpenVLA integrates a Llama 2 LLM with visual encoders from DINOv2 and SigLIP, capturing visual features at multiple granularities. A core component of OpenVLA's success is its extensive training on diverse robotic manipulation trajectories across multiple robot embodiments. This diverse dataset includes tasks and environments that promote generalization to new object appearances, arrangements, and instructions.

Performance Metrics:

OpenVLA outperforms the 55B-parameter RT-2-X model, achieving a 16.5% absolute improvement in task success rate across 29 tasks and multiple robots, despite using seven times fewer parameters. Additionally, it showed notable fine-tuning capabilities, outperforming methods like Diffusion Policy by 20.4% in multi-task environments involving language grounding and object manipulation. Noteworthy is OpenVLA's capability to fine-tune efficiently on consumer-grade GPUs without compromising task success, thanks to methods like low-rank adaptation (LoRA) and quantization.

Evaluation on Multiple Axes

The evaluations span several dimensions of generalization, including visual (unseen object appearances and backgrounds), motion (novel object positions and orientations), physical (different object shapes and sizes), and semantic (unseen task instructions and concepts). These evaluations were conducted on two key platforms: the BridgeData V2 WidowX robot and the Google mobile manipulation robot.

BridgeData V2 Evaluations:

OpenVLA's performance on the BridgeData V2 tasks emphasizes its robustness in handling visual and semantic generalizations with multiple objects and complex task dynamics. The model significantly surpasses RT-2-X, RT-1-X, and Octo in these evaluations, evidencing the benefits of the diverse training dataset and the fused vision encoder components.

Practical Efficiency:

OpenVLA demonstrates practical efficiency not only in terms of parameter count and resource requirements for training and inference but also in its ability to be adapted rapidly to new setups. The integration of low-bit quantization techniques and LoRA fine-tuning allows OpenVLA to function effectively on consumer-grade hardware. This represents a considerable advantage for real-world deployment scenarios where access to high-end computational resources may be restricted.

Implications and Future Directions

OpenVLA sets a new standard in the field of robotic manipulation by combining the strengths of large-scale internet-pretrained vision and LLMs with robust, diverse robot demonstration datasets. Its open-source nature paves the way for broader research and application, potentially accelerating advancements in multi-robot coordination, fine-tuning strategies, and the deployment of sophisticated robotics systems in varying environments.

Future Research Directions:

Extending Sensory Inputs: Future iterations might include multiple sensory modalities such as proprioceptive data, extending beyond single-image observations to encompass more comprehensive state representations.
Higher Frequency Control: Improvements in the inference speed of OpenVLA are critical for adapting it to high-frequency control systems, enabling more complex and precise manipulation tasks.
Model Architecture and Dataset Diversification: Investigations into the impact of larger base VLMs, co-training strategies on mixed datasets, and exploring different visual features could further enhance the model’s robustness and versatility.

In conclusion, OpenVLA significantly contributes to the field by addressing key limitations of previous models, offering superior performance, practical efficiency, and paving the way for future advancements in robotic control through open-source collaboration and innovation.

Related Papers

Tweets

https://twitter.com/_xjdr/status/1802043779505090823

https://twitter.com/arankomatsuzaki/status/1801443383753273588

https://twitter.com/moo_jin_kim/status/1831788781864939663

https://twitter.com/RemiCadene/status/1824723310850285640

https://twitter.com/RemiCadene/status/1826590244520792492

https://twitter.com/asoare159/status/1813505870954660345

YouTube

Show All Videos