Papers
Topics
Authors
Recent
2000 character limit reached

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset (2403.12945v2)

Published 19 Mar 2024 in cs.RO

Abstract: The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023.
  2. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  3. Scaling data-driven robotics with reward sketching and batch reinforcement learning. RSS, 2019.
  4. nuscenes: A multimodal dataset for autonomous driving. preprint arXiv:1903.11027, 2019.
  5. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  6. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.
  7. Robonet: Large-scale multi-robot learning. CoRL, 2019.
  8. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a.
  9. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023b.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  11. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568, 2018.
  12. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021.
  13. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023.
  14. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024.
  15. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  16. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
  17. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  18. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018.
  19. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023.
  20. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  21. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
  22. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  23. VIMA: Robot manipulation with multimodal prompts. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 14975–15022. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/jiang23b.html.
  24. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
  25. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv, 2021.
  26. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022.
  27. Learning hand-eye coordination for robotic grasping with large-scale data collection. In International Symposium on Experimental Robotics. Springer, 2016.
  28. Polymetis. https://facebookresearch.github.io/fairo/polymetis/, 2021.
  29. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  30. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893. PMLR, 2018.
  31. What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021.
  32. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
  33. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023.
  34. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016.
  35. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, 2023.
  36. Ahad Rana. Common crawl – building an open web-scale crawl using hadoop, 2010. URL https://www.slideshare.net/hadoopusergroup/common-crawlpresentation.
  37. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. URL https://api.semanticscholar.org/CorpusID:203626972.
  38. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  39. On bringing robots home, 2023.
  40. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023a.
  41. ViNT: A foundation model for visual navigation. In 7th Annual Conference on Robot Learning, 2023b. URL https://arxiv.org/abs/2306.14846.
  42. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning, pages 906–915. PMLR, 2018.
  43. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  44. Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. IEEE Robotics and Automation Letters, 5(3):4978–4985, 2020.
  45. Nomad: Goal masked diffusion policies for navigation and exploration. arXiv preprint arXiv:2310.07896, 2023.
  46. Scalability in perception for autonomous driving: Waymo open dataset, 2019.
  47. Tartandrive: A large-scale dataset for learning off-road dynamics models. In 2022 International Conference on Robotics and Automation (ICRA), pages 2546–2552. IEEE, 2022.
  48. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  49. Bridgedata v2: A dataset for robot learning at scale, 2023.
  50. Visual imitation made easy, 2020.
  51. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  52. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
  53. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In 7th Annual Conference on Robot Learning, 2023.
Citations (92)

Summary

  • The paper introduces DROID, a dataset that improves robotic manipulation by providing diverse, high-quality training data across 564 scenes and 86 tasks.
  • The methodology employs a standardized platform with Franka Emika Panda and triple Zed cameras across 13 institutions to capture comprehensive interaction data.
  • Experimental evaluations reveal that DROID-trained policies achieve up to 22% improvement in-distribution and 17% out-of-distribution success rates, emphasizing the benefits of scene diversity.

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Overview

The paper "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset" introduces a new, extensive dataset aimed at improving robotic manipulation policies by leveraging varied and high-quality training data. Unlike prior datasets constrained by limited environments, DROID spans a significant diversity of scenes, tasks, and interactions, collected across multiple continents and institutions to push the boundaries of policy generalization in robotic manipulation.

Methodology

Data Collection Platform

DROID's robust data collection is powered by a standardized hardware setup used across 13 institutions worldwide:

  • Robot and Sensors: Utilizes the Franka Emika Panda 7 DoF robot arm, equipped with a Robotiq 2F-85 gripper. The visual input is captured using three synchronized Zed stereo cameras placed at strategic locations, including a wrist-mounted camera, ensuring a comprehensive view of the robotic interaction space. Figure 1

    Figure 1: The DROID robot platform employed for uniform data collection across diverse institutions.

Scene and Task Diversity

The dataset is unparalleled in its scope, encompassing:

  • 564 unique scenes across 52 buildings, reflecting real-world environments such as homes and offices.
  • 86 distinct tasks, captured through natural language instructions, enhancing the breadth of task representation and generalization.

DROID's collection protocol prioritizes diverse scene selection and frequent environmental adjustments, enhancing the dataset's applicability to varied robotic learning tasks. Figure 2

Figure 2: Distribution of verbs and objects in DROID, highlighting diverse task behaviors.

Dataset Characteristics

Figure 2 conveys the extensive range of verbs and interacted objects in DROID. This diversity is further detailed by analyzing the distribution and joint occurrences of verbs and objects, underscoring the dataset's capability to broaden policy learning scopes.

DROID addresses key diversity axes such as task, interaction point, and scene diversity. The dataset's commitment to meticulous data diversity encapsulates a nuanced approach to robot policy training, fostering better generalization capabilities. Figure 3

Figure 3: Visualization of 3D interaction points, emphasizing DROID's extensive workspace coverage.

Experimental Evaluation

Policy Performance and Robustness

Rigorous experiments confirm that policies trained with DROID data outperform those trained on traditional datasets:

  • Success Metrics: DROID-trained policies demonstrate up to a 22% improvement in success rates within distribution and up to 17% in out-of-distribution (OOD) scenarios. Figure 4

    Figure 4: Performance comparison highlights DROID's superiority in enhancing both in-distribution and OOD success rates.

Impact of Scene Diversity

Further evaluations illustrate that even a subset of DROID, with varied scenes, surpasses less diverse counterparts, reinforcing the dataset's emphasis on scene diversity as pivotal for effective policy development. Figure 5

Figure 5: Evaluating scene diversity importance, demonstrating superior OOD performance with diverse training scenes.

Conclusion

DROID's introduction marks a significant contribution to the field of robotic manipulation datasets, aiming for generalizable and robust policy learning. The dataset's extensive diversity in tasks, objects, and scenes presents new opportunities for building adaptable robotic systems capable of thriving in real-world settings. As an open-source resource, DROID promises to foster widespread advancements in robotic policy learning and application.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 54 likes about this paper.