Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation (2403.08355v1)

Published 13 Mar 2024 in cs.RO and cs.CV

Abstract: Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector's current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross-modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end-effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. A statistical approach to machine translation. Computational linguistics, 16(2):79–85, 1990.
  2. Manipulathor: A framework for visual object manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4497–4506, 2021.
  3. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  4. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023.
  5. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  6. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  7. Pyrep: Bringing v-rep to deep robot learning. arXiv preprint arXiv:1906.11176, 2019.
  8. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
  9. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  10. Vima: General robot manipulation with multimodal prompts. In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  11. Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation. arXiv preprint arXiv:2305.18898, 2023.
  12. Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021.
  13. Adapt: Vision-language navigation with modality-aligned action prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15396–15406, 2022.
  14. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  15. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  16. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 909–918, 2019.
  17. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  18. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  19. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  20. V-rep: A versatile and scalable robot simulation framework. In 2013 IEEE/RSJ international conference on intelligent robots and systems, pages 1321–1326. IEEE, 2013.
  21. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  22. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
  23. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020.
  24. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pages 477–490. PMLR, 2022.
  25. Fabricflownet: Bimanual cloth manipulation with a flow-based policy. In Conference on Robot Learning, pages 192–202. PMLR, 2022.
  26. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  27. Invigorate: Interactive visual grounding and grasping in clutter. arXiv preprint arXiv:2108.11092, 2021.
  28. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  29. Vlmbench: A compositional benchmark for vision-and-language manipulation. Advances in Neural Information Processing Systems, 35:665–678, 2022.
  30. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.