QwenGrasp: A Usage of Large Vision-Language Model for Target-Oriented Grasping (2309.16426v3)
Abstract: Target-oriented grasping in unstructured scenes with language control is essential for intelligent robot arm grasping. The ability for the robot arm to understand the human language and execute corresponding grasping actions is a pivotal challenge. In this paper, we propose a combination model called QwenGrasp which combines a large vision-LLM with a 6-DoF grasp neural network. QwenGrasp is able to conduct a 6-DoF grasping task on the target object with textual language instruction. We design a complete experiment with six-dimension instructions to test the QwenGrasp when facing with different cases. The results show that QwenGrasp has a superior ability to comprehend the human intention. Even in the face of vague instructions with descriptive words or instructions with direction information, the target object can be grasped accurately. When QwenGrasp accepts the instruction which is not feasible or not relevant to the grasping task, our approach has the ability to suspend the task execution and provide a proper feedback to humans, improving the safety. In conclusion, with the great power of large vision-LLM, QwenGrasp can be applied in the open language environment to conduct the target-oriented grasping task with freely input instructions.
- A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, and R. Julian, “Do as i can, not as i say: Grounding language in robotic affordances,” in Conference on Robot Learning. PMLR, 2023, Conference Proceedings, pp. 287–318.
- K. Xu, S. Zhao, Z. Zhou, Z. Li, H. Pi, Y. Zhu, Y. Wang, and R. Xiong, “A joint modeling of vision-language-action for target-oriented grasping in clutter,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 11 597–11 604.
- B. Zhao, H. Zhang, X. Lan, H. Wang, Z. Tian, and N. Zheng, “Regnet: Region-based grasp network for end-to-end grasp detection in point clouds,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Conference Proceedings, pp. 13 474–13 480.
- K. Fang, Y. Bai, S. Hinterstoisser, S. Savarese, and M. Kalakrishnan, “Multi-task domain adaptation for deep learning of instance grasping from simulation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Conference Proceedings, pp. 3516–3523.
- A. Murali, A. Mousavian, C. Eppner, C. Paxton, and D. Fox, “6-dof grasping for target-driven object manipulation in clutter,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Conference Proceedings, pp. 6232–6238.
- M. Sun and Y. Gao, “Gater: Learning grasp-action-target embeddings and relations for task-specific grasping,” IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 618–625, 2021.
- A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, and E. Romo, “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” The International Journal of Robotics Research, vol. 41, no. 7, pp. 690–705, 2022.
- Y. Mo, H. Zhang, and T. Kong, “Towards open-world interactive disambiguation for robotic grasping,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Conference Proceedings, pp. 8061–8067.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark, “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, Conference Proceedings, pp. 8748–8763.
- H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, Conference Proceedings, pp. 11 444–11 453.
- G. Zhai, D. Huang, S.-C. Wu, H. Jung, Y. Di, F. Manhardt, F. Tombari, N. Navab, and B. Busam, “Monograspnet: 6-dof grasping with a single rgb image,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, Conference Proceedings, pp. 1708–1714.
- Y. Qin, R. Chen, H. Zhu, M. Song, J. Xu, and H. Su, “S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes,” in Conference on robot learning. PMLR, 2020, Conference Proceedings, pp. 53–65.
- A. Mousavian, C. Eppner, D. Fox, and Ieee, “6-dof graspnet: Variational grasp generation for object manipulation,” in IEEE/CVF International Conference on Computer Vision (ICCV), ser. IEEE International Conference on Computer Vision, 2019, Conference Proceedings, pp. 2901–2910, mousavian, Arsalan Eppner, Clemens Fox, Dieter 1550-5499.
- C. R. Qi, H. Su, K. Mo, L. J. Guibas, and Ieee, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ser. IEEE Conference on Computer Vision and Pattern Recognition, 2017, Conference Proceedings, pp. 77–85, times Cited: 1552 1063-6919.
- D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, and T. Yu, “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
- W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” arXiv preprint arXiv:2307.05973, 2023.
- J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, and S. Bhosale, “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- H. Z. Liang, X. J. Ma, S. Li, M. Gorner, S. Tang, B. Fang, F. C. Sun, J. W. Zhang, and Ieee, “Pointnetgpd: Detecting grasp configurations from point sets,” in IEEE International Conference on Robotics and Automation (ICRA), ser. IEEE International Conference on Robotics and Automation ICRA, 2019, Conference Proceedings, pp. 3629–3635, liang, Hongzhuo Ma, Xiaojian Li, Shuang Gorner, Michael Tang, Song Fang, Bin Sun, Fuchun Zhang, Jianwei 1050-4729.
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, and S. Gehrmann, “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.