Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents (2403.19622v2)

Published 28 Mar 2024 in cs.RO and cs.CV

Abstract: Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-LLMs (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It is also promising for robotic manipulation to adapt such composable generalization ability, in the form of composable generalization agents (CGAs). However, the community lacks of reliable design of primitive skills and a sufficient amount of primitive-level data annotations. Therefore, we propose RH20T-P, a primitive-level robotic manipulation dataset, which contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios. Each clip is manually annotated according to a set of meticulously designed primitive skills that are common in robotic manipulation. Furthermore, we standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P, whose positive performance on solving unseen tasks validates that the proposed dataset can offer composable generalization ability to robotic manipulation agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  2. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:22 12.06817, 2022.
  3. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
  4. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  5. Modem: Accelerating visual model-based reinforcement learning with demonstrations. arXiv preprint arXiv:2212.05698, 2022.
  6. Language reward modulation for pretraining reinforcement learning. arXiv preprint arXiv:2308.12270, 2023.
  7. Video prediction models as rewards for reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  8. Action-quantized offline reinforcement learning for robotic skill learning. In Conference on Robot Learning, pages 1348–1361. PMLR, 2023.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. OpenAI. Gpt-4 technical report, 2023.
  11. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  12. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  13. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288, 2023b.
  14. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  15. OpenAI. Gpt-4v(ision) system card, 2023.
  16. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  17. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Advances in Neural Information Processing Systems, 36, 2024.
  18. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  19. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  20. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  21. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  22. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023.
  23. Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration. arXiv preprint arXiv:2311.12015, 2023.
  24. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  25. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
  26. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  27. Rh20t: A robotic dataset for learning diverse skills in one-shot. arXiv preprint arXiv:2307.00595, 2023.
  28. C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020.
  29. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022.
  30. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  31. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1601–1610, 2021.
  32. Siamese detr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15722–15731, 2023.
  33. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  34. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  36. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  37. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  38. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  39. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36, 2024.
  40. Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023.
  41. Octavius: Mitigating task interference in mllms via moe. arXiv preprint arXiv:2311.02684, 2023.
  42. Tidybot: Personalized robot assistance with large language models. arXiv preprint arXiv:2305.05658, 2023.
  43. Interactive task planning with language models. arXiv preprint arXiv:2310.10645, 2023.
  44. Creative robot tool use with large language models. arXiv preprint arXiv:2310.13065, 2023.
  45. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
  46. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023.
  47. Mp5: A multi-modal open-ended embodied system in minecraft via active perception. arXiv preprint arXiv:2312.07472, 2023.
  48. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893. PMLR, 2018.
  49. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019.
  50. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning, pages 906–915. PMLR, 2018.
  51. Towards more generalizable one-shot visual imitation learning. In 2022 International Conference on Robotics and Automation (ICRA), pages 2434–2444, 2022.
  52. Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024.
  53. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. arXiv preprint arXiv:2312.16217, 2023.
  54. Learning orbitally stable systems for diagrammatic teaching. In CoRL 2023 Workshop on Learning Effective Abstractions for Planning (LEAP), 2023.
  55. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com