Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Grounding Classical Task Planners via Vision-Language Models (2304.08587v3)

Published 17 Apr 2023 in cs.RO

Abstract: Classical planning systems have shown great advances in utilizing rule-based human knowledge to compute accurate plans for service robots, but they face challenges due to the strong assumptions of perfect perception and action executions. To tackle these challenges, one solution is to connect the symbolic states and actions generated by classical planners to the robot's sensory observations, thus closing the perception-action loop. This research proposes a visually-grounded planning framework, named TPVQA, which leverages Vision-LLMs (VLMs) to detect action failures and verify action affordances towards enabling successful plan execution. Results from quantitative experiments show that TPVQA surpasses competitive baselines from previous studies in task completion rate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8821–8831.
  2. Y.-q. Jiang, S.-q. Zhang, P. Khandelwal, and P. Stone, “Task planning in robotics: an empirical comparison of pddl-and asp-based systems,” Frontiers of Information Technology & Electronic Engineering, vol. 20, pp. 363–373, 2019.
  3. G. Brewka, T. Eiter, and M. Truszczyński, “Answer set programming at a glance,” Communications of the ACM, vol. 54, no. 12, pp. 92–103, 2011.
  4. V. Lifschitz, “Answer set programming and plan generation,” Artificial Intelligence, vol. 138, no. 1-2, pp. 39–54, 2002.
  5. M. Fox and D. Long, “Pddl2. 1: An extension to pddl for expressing temporal planning domains,” Journal of artificial intelligence research, vol. 20, pp. 61–124, 2003.
  6. S. Zhang, F. Yang, P. Khandelwal, and P. Stone, “Mobile robot planning using action language bc with an abstraction hierarchy,” in International Conference on Logic Programming and Nonmonotonic Reasoning.   Springer, 2015, pp. 502–516.
  7. Y. Ding, X. Zhang, X. Zhan, and S. Zhang, “Task-motion planning for safe and efficient urban driving,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
  8. Y. Ding, C. Cui, X. Zhang, and S. Zhang, “Glad: Grounded layered autonomous driving for complex service tasks,” arXiv preprint arXiv:2210.02302, 2022.
  9. Y. Jiang, H. Yedidsion, S. Zhang, G. Sharon, and P. Stone, “Multi-robot planning with conflicts and synergies,” Autonomous Robots, vol. 43, no. 8, pp. 2011–2032, 2019.
  10. Y. Ding, X. Zhang, X. Zhan, and S. Zhang, “Learning to ground objects for robot task and motion planning,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5536–5543, 2022.
  11. Y. Ding, X. Zhang, S. Amiri, N. Cao, H. Yang, C. Esselink, and S. Zhang, “Robot task planning and situation handling in open worlds,” arXiv preprint arXiv:2210.01287, 2022.
  12. F. Lagriffoul, N. T. Dantam, C. Garrett, A. Akbari, S. Srivastava, and L. E. Kavraki, “Platform-independent benchmarks for task and motion planning,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3765–3772, 2018.
  13. L. P. Kaelbling and T. Lozano-Pérez, “Integrated task and motion planning in belief space,” The International Journal of Robotics Research, vol. 32, no. 9-10, pp. 1194–1227, 2013.
  14. Y. Zhu, J. Tremblay, S. Birchfield, and Y. Zhu, “Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 6541–6548.
  15. X. Zhang, Y. Zhu, Y. Ding, Y. Zhu, P. Stone, and S. Zhang, “Visually grounded task and motion planning for mobile manipulation,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 1925–1931.
  16. D. Driess, J.-S. Ha, and M. Toussaint, “Deep visual reasoning: Learning to predict action sequences for task and motion planning from an initial scene image,” arXiv preprint arXiv:2006.05398, 2020.
  17. D. Driess, O. Oguz, J.-S. Ha, and M. Toussaint, “Deep visual heuristics: Learning feasibility of mixed-integer programs for manipulation planning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 9563–9569.
  18. A. M. Wells, N. T. Dantam, A. Shrivastava, and L. E. Kavraki, “Learning feasibility for task and motion planning in tabletop environments,” IEEE robotics and automation letters, vol. 4, no. 2, pp. 1255–1262, 2019.
  19. Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and motion planning with large language models for object rearrangement,” arXiv preprint arXiv:2303.06247, 2023.
  20. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  21. H. Ha and S. Song, “Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models,” in Conference on Robot Learning, 2022.
  22. L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, H. Zhu, A. Tang, D.-A. Huang, Y. Zhu, and A. Anandkumar, “Minedojo: Building open-ended embodied agents with internet-scale knowledge,” arXiv preprint arXiv:2206.08853, 2022.
  23. N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam, “Clip-fields: Weakly supervised semantic fields for robotic memory,” arXiv preprint arXiv: Arxiv-2210.05663, 2022.
  24. M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Proceedings of the 5th Conference on Robot Learning (CoRL), 2021.
  25. A. Stone, T. Xiao, Y. Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, B. Zitkovich, F. Xia, C. Finn, and K. Hausman, “Open-world object manipulation using pre-trained vision-language model,” in arXiv preprint, 2023.
  26. Y. Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi, “Vision-language models as success detectors,” arXiv preprint arXiv:2303.07280, 2023.
  27. D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
  28. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning.   PMLR, 2020, pp. 1597–1607.
  29. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164.
  30. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen et al., “Simple open-vocabulary object detection with vision transformers,” arXiv preprint arXiv:2205.06230, 2022.
  31. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
  32. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  33. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913.
  34. M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer, “Allennlp: A deep semantic natural language processing platform,” arXiv preprint arXiv:1803.07640, 2018.
  35. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
  36. X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba, “Virtualhome: Simulating household activities via programs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8494–8502.
  37. K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012.
  38. D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” arXiv preprint arXiv:1804.05172, 2018.
Citations (13)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.