OVAL-Prompt: Open-Vocabulary Affordance Localization for Robot Manipulation through LLM Affordance-Grounding (2404.11000v2)
Abstract: In order for robots to interact with objects effectively, they must understand the form and function of each object they encounter. Essentially, robots need to understand which actions each object affords, and where those affordances can be acted on. Robots are ultimately expected to operate in unstructured human environments, where the set of objects and affordances is not known to the robot before deployment (i.e. the open-vocabulary setting). In this work, we introduce OVAL-Prompt, a prompt-based approach for open-vocabulary affordance localization in RGB-D images. By leveraging a Vision LLM (VLM) for open-vocabulary object part segmentation and a LLM to ground each part-segment-affordance, OVAL-Prompt demonstrates generalizability to novel object instances, categories, and affordances without domain-specific finetuning. Quantitative experiments demonstrate that without any finetuning, OVAL-Prompt achieves localization accuracy that is competitive with supervised baseline models. Moreover, qualitative experiments show that OVAL-Prompt enables affordance-based robot manipulation of open-vocabulary object instances and categories. Project Page: https://ekjt.github.io/OVAL-Prompt/
- J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” 2017.
- H. Luo, W. Zhai, J. Zhang, Y. Cao, and D. Tao, “Learning affordance grounding from exocentric images,” in CVPR, 2022.
- T. Nguyen, M. N. Vu, A. Vuong, D. Nguyen, T. Vo, N. Le, and A. Nguyen, “Open-vocabulary affordance detection in 3d point clouds,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 5692–5698.
- M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and A. Zeng, “Do as i can and not as i say: Grounding language in robotic affordances,” in arXiv preprint arXiv:2204.01691, 2022.
- S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li, “Affordancellm: Grounding affordance from vision language models,” arXiv preprint arXiv:2401.06341, 2024.
- J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in arXiv preprint arXiv:2209.07753, 2022.
- K. Kawaharazuka, T. Matsushima, A. Gambardella, J. Guo, C. Paxton, and A. Zeng, “Real-world robot applications of foundation models: A review,” arXiv preprint arXiv:2402.05741, 2024.
- S. Kambhampati, K. Valmeekam, L. Guan, K. Stechly, M. Verma, S. Bhambri, L. Saldyt, and A. Murthy, “Llms can’t plan, but can help planning in llm-modulo frameworks,” arXiv preprint arXiv:2402.01817, 2024.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021.
- A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg, “Language embedded radiance fields for zero-shot task-oriented grasping,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://openreview.net/forum?id=k-Fg8JDQmc
- K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On the planning abilities of large language models-a critical investigation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos, “Affordance detection of tool parts from geometric features,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 1374–1381.
- D. Chen, D. Kong, J. Li, S. Wang, and B. Yin, “A survey of visual affordance recognition based on deep learning,” IEEE Transactions on Big Data, vol. 9, no. 6, pp. 1458–1476, 2023.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- A. Nguyen, D. Kanoulas, D. G. Caldwell, and N. G. Tsagarakis, “Detecting object affordances with convolutional neural networks,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 2765–2770.
- ——, “Object-based affordances detection with convolutional neural networks and dense conditional random fields,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 5908–5915.
- T.-T. Do, A. Nguyen, and I. Reid, “Affordancenet: An end-to-end deep learning approach for object affordance detection,” 2018.
- T. Nagarajan, C. Feichtenhofer, and K. Grauman, “Grounded human-object interaction hotspots from video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8688–8697.
- K. Fang, Y. Zhu, A. Garg, A. Kurenkov, V. Mehta, L. Fei-Fei, and S. Savarese, “Learning task-oriented grasping for tool manipulation from simulated self-supervision,” The International Journal of Robotics Research, vol. 39, no. 2-3, pp. 202–216, 2020.
- S. Deng, X. Xu, C. Wu, K. Chen, and K. Jia, “3d affordancenet: A benchmark for visual object affordance understanding,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1778–1787.
- J. Wu, X. Li, S. Xu, H. Yuan, H. Ding, Y. Yang, X. Li, J. Zhang, Y. Tong, X. Jiang, B. Ghanem, and D. Tao, “Towards open vocabulary learning: A survey,” T-PAMI, 2024.
- G. Li, V. Jampani, D. Sun, and L. Sevilla-Lara, “Locate: Localize and transfer object parts for weakly supervised affordance grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 922–10 931.
- G. Li, D. Sun, L. Sevilla-Lara, and V. Jampani, “One-shot open affordance learning with foundation models,” 2023.
- OpenAI, “Gpt-4,” https://openai.com/research/gpt-4, 2024, accessed: 2024-03-04.
- “Gemini: Google deepmind’s multimodal language model,” https://deepmind.google, 2024, accessed: 2024-03-04.
- C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
- D. Shah, B. Osinski, B. Ichter, and S. Levine, “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” 2022.
- K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On the planning abilities of large language models : A critical investigation,” 2023.
- S. Kambhampati, “Can llms really reason and plan?” Communications of the ACM, Sep 2023. [Online]. Available: https://cacm.acm.org/blogcacm/can-llms-really-reason-and-plan/
- P. Sun, S. Chen, C. Zhu, F. Xiao, P. Luo, S. Xie, and Z. Yan, “Going denser with open-vocabulary part segmentation,” 2023.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” 2017.
- Q. Gu, J. Su, and L. Yuan, “Visual affordance detection using an efficient attention convolutional neural network,” vol. 440, pp. 36–44. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0925231221000278
- X. Zhao, Y. Cao, and Y. Kang, “Object affordance detection with relationship-aware network,” vol. 32, no. 18, pp. 14 321–14 333. [Online]. Available: http://link.springer.com/10.1007/s00521-019-04336-0
- Y. Zhang, H. Li, T. Ren, Y. Dou, and Q. Li, “Multi-scale fusion and global semantic encoding for affordance detection,” in 2022 International Joint Conference on Neural Networks (IJCNN), 2022, pp. 1–8.
- D. T. Coleman, I. A. Sucan, S. Chitta, and N. Correll, “Reducing the barrier to entry of complex robotic software: a MoveIt! case study,” publisher: [object Object]. [Online]. Available: https://aisberg.unibg.it//handle/10446/87657
- J. Mai, M. Yang, and W. Luo, “Erasing integrated learning: A simple yet effective approach for weakly supervised object localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- X. Pan, Y. Gao, Z. Lin, F. Tang, W. Dong, H. Yuan, F. Huang, and C. Xu, “Unveiling the potential of structure preserving for weakly supervised object localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 11 642–11 651.
- Y. Yao, F. Wan, W. Gao, X. Pan, Z. Peng, Q. Tian, and Q. Ye, “Ts-cam: Token semantic coupled attention map for weakly supervised object localization,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–13, 2022.
- H. Luo, W. Zhai, J. Zhang, Y. Cao, and D. Tao, “Grounded affordance from exocentric view,” 2023.
- D. Hadjivelichkov, S. Zwane, M. P. Deisenroth, L. Agapito, and D. Kanoulas, “One-shot transfer of affordance regions? affcorrs!” 2022.
- Edmond Tong (1 paper)
- Anthony Opipari (10 papers)
- Stanley Lewis (8 papers)
- Zhen Zeng (41 papers)
- Odest Chadwicke Jenkins (41 papers)