MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting (2403.03174v3)
Abstract: Open-world generalization requires robotic systems to have a profound understanding of the physical world and the user command to solve diverse and complex tasks. While the recent advancement in vision-LLMs (VLMs) has offered unprecedented opportunities to solve open-world problems, how to leverage their capabilities to control robots remains a grand challenge. In this paper, we introduce Marking Open-world Keypoint Affordances (MOKA), an approach that employs VLMs to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. By prompting the pre-trained VLM, our approach utilizes the VLM's commonsense knowledge and concept understanding acquired from broad data sources to predict affordances and generate motions. To facilitate the VLM's reasoning in zero-shot and few-shot manners, we propose a visual prompting technique that annotates marks on images, converting affordance reasoning into a series of visual question-answering problems that are solvable by the VLM. We further explore methods to enhance performance with robot experiences collected by MOKA through in-context learning and policy distillation. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Supervised learning of hidden and non-hidden 0-order affordances and detection in real scenes. In 2012 IEEE international conference on robotics and automation, pages 1732–1739. IEEE, 2012.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11509–11522. IEEE, 2023.
- Real-time 3d model-based tracking using edge and keypoint features for robotic manipulation. In 2010 IEEE International Conference on Robotics and Automation, pages 4048–4055. IEEE, 2010.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.
- Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023.
- James J Gibson. The ecological approach to visual perception: classic edition. Psychology press, 2014.
- Affordance prediction via learned object attributes. In IEEE international conference on robotics and automation (ICRA): Workshop on semantic perception, mapping, and exploration, pages 181–184. Citeseer, 2011.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022a.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Reward design with language models. arXiv preprint arXiv:2303.00001, 2023.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022b.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
- Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
- Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
- Zero-shot reward specification via grounded natural language. In International Conference on Machine Learning, pages 14743–14752. PMLR, 2022.
- Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In 2010 IEEE International Conference on Robotics and Automation, pages 2308–2315. IEEE, 2010.
- kpam: Keypoint affordances for category-level robotic manipulation. In International Symposium of Robotics Research, 2019. URL https://api.semanticscholar.org/CorpusID:80628296.
- Parametrized shape models for clothing. In 2011 IEEE International Conference on Robotics and Automation, pages 4861–4868. IEEE, 2011.
- Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
- Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
- Keto: Learning keypoint representations for tool manipulation, 2019.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Grounded sam: Assembling open-world models for diverse visual tasks, 2024.
- Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. arXiv preprint arXiv:2306.14435, 2023.
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
- Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023.
- Learning visual object categories for robot affordance prediction. The International Journal of Robotics Research, 29(2-3):174–197, 2010.
- Gravity-based robotic cloth folding. In Algorithmic Foundations of Robotics IX, pages 409–424. Springer, 2010.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023a.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. ArXiv, abs/2310.11441, 2023b. URL https://api.semanticscholar.org/CorpusID:266149987.
- Fine-grained visual prompting. arXiv preprint arXiv:2306.04356, 2023c.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023d.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
- Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.