Verifiably Following Complex Robot Instructions with Foundation Models (2402.11498v2)
Abstract: Enabling mobile robots to follow complex natural language instructions is an important yet challenging problem. People want to flexibly express constraints, refer to arbitrary landmarks and verify behavior when instructing robots. Conversely, robots must disambiguate human instructions into specifications and ground instruction referents in the real world. We propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow expressive and complex open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives and affords the synthesis of robot behaviors that are correct-by-construction. We perform a large scale evaluation and demonstrate our approach on 150 instructions in five real-world environments showing the generality of our approach and the ease of deployment in novel unstructured domains. In our experiments, LIMP performs comparably with state-of-the-art LLM task planners and LLM code-writing planners on standard open vocabulary tasks and additionally achieves 79\% success rate on complex spatiotemporal instructions while LLM and Code-writing planners both achieve 38\%. See supplementary materials and demo videos at https://robotlimp.github.io
- Y. Hu, Q. Xie, V. Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y. Xie, T. Zhang, Z. Zhao, et al., “Toward general-purpose robots via foundation models: A survey and meta-analysis,” arXiv preprint arXiv:2312.08782, 2023.
- R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman, et al., “Foundation models in robotics: Applications, challenges, and the future,” arXiv preprint arXiv:2312.07843, 2023.
- S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23171–23181, 2023.
- D. Shah, B. Osiński, S. Levine, et al., “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in Conference on Robot Learning, pp. 492–504, PMLR, 2023.
- C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 10608–10615, IEEE, 2023.
- C. Huang, O. Mees, A. Zeng, and W. Burgard, “Audio visual language maps for robot navigation,” arXiv preprint arXiv:2303.07522, 2023.
- J. X. Liu, Z. Yang, I. Idrees, S. Liang, B. Schornstein, S. Tellex, and A. Shah, “Grounding complex natural language commands for temporal tasks in unseen environments,” in Conference on Robot Learning, pp. 1084–1110, PMLR, 2023.
- P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.
- S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y. Yang, V. Jain, A. W. Clegg, J. Turner, et al., “Homerobot: Open-vocabulary mobile manipulation,” arXiv preprint arXiv:2306.11565, 2023.
- B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11509–11522, IEEE, 2023.
- J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493–9500, IEEE, 2023.
- E. A. Emerson, “Temporal and modal logic,” in Formal Models and Semantics, pp. 995–1072, Elsevier, 1990.
- J. Pan, G. Chou, and D. Berenson, “Data-efficient learning of natural language to linear temporal logic translators for robot task specification,” arXiv preprint arXiv:2303.08006, 2023.
- B. Quartey, A. Shah, and G. Konidaris, “Exploiting contextual structure to generate useful auxiliary tasks,” arXiv preprint arXiv:2303.05038, 2023.
- C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Pérez, “Integrated task and motion planning,” Annual review of control, robotics, and autonomous systems, vol. 4, pp. 265–293, 2021.
- J. X. Liu, Z. Yang, I. Idrees, S. Liang, B. Schornstein, S. Tellex, and A. Shah, “Lang2ltl: Translating natural language commands to temporal robot task specification,” arXiv preprint arXiv:2302.11649, 2023.
- M. Berg, D. Bayazit, R. Mathew, A. Rotter-Aboyoun, E. Pavlick, and S. Tellex, “Grounding language to landmarks in arbitrary outdoor environments,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 208–215, IEEE, 2020.
- M. Cosler, C. Hahn, D. Mendoza, F. Schmitt, and C. Trippel, “nl2spec: Interactively translating unstructured natural language to temporal logics with large language models,” arXiv preprint arXiv:2303.04864, 2023.
- F. Fuggitti and T. Chakraborti, “Nl2ltl–a python package for converting natural language (nl) instructions to linear temporal logic (ltl) formulas,” in AAAI Conference on Artificial Intelligence, 2023.
- Y. Chen, R. Gandhi, Y. Zhang, and C. Fan, “Nl2tl: Transforming natural languages to temporal logics using large language models,” arXiv preprint arXiv:2305.07766, 2023.
- A. Pnueli, “The temporal logic of programs,” in Proceedings of the 18th Annual Symposium on Foundations of Computer Science, SFCS ’77, p. 46–57, IEEE Computer Society, 1977.
- F. Bacchus and F. Kabanza, “Using temporal logics to express search control knowledge for planning,” Artificial intelligence, vol. 116, no. 1-2, pp. 123–191, 2000.
- M. Y. Vardi, “An automata-theoretic approach to linear temporal logic,” Logics for concurrency: structure versus automata, pp. 238–266, 2005.
- H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-logic-based reactive mission and motion planning,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1370–1381, 2009.
- M. Colledanchise, R. M. Murray, and P. Ögren, “Synthesis of correct-by-construction behavior trees,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6039–6046, 2017.
- S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that use language,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 25–55, 2020.
- V. Blukis, R. A. Knepper, and Y. Artzi, “Few-shot object grounding and mapping for natural language robot instruction following,” arXiv preprint arXiv:2011.07384, 2020.
- R. Patel, R. Pavlick, and S. Tellex, “Learning to ground language to temporal logical form,” in NAACL, 2019.
- C. Wang, C. Ross, Y.-L. Kuo, B. Katz, and A. Barbu, “Learning a natural-language to ltl executable semantic parser for grounded robotics,” in Conference on Robot Learning, pp. 1706–1718, PMLR, 2021.
- K. Zheng, D. Bayazit, R. Mathew, E. Pavlick, and S. Tellex, “Spatial language understanding for object search in partially observed city-scale environments,” in 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), pp. 315–322, IEEE, 2021.
- X. Wang, W. Wang, J. Shao, and Y. Yang, “Lana: A language-capable navigator for instruction following and generation,” arXiv preprint arXiv:2303.08409, 2023.
- S.-M. Park and Y.-G. Kim, “Visual language navigation: A survey and open challenges,” Artificial Intelligence Review, vol. 56, no. 1, pp. 365–427, 2023.
- B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” arXiv preprint arXiv:2209.09874, 2022.
- B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” arXiv preprint arXiv:2304.05501, 2023.
- C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” arXiv preprint arXiv:2212.04088, 2022.
- C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” arXiv preprint arXiv:2210.05714, 2022.
- J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” arXiv preprint arXiv:2209.07753, 2022.
- E. Hsiung, H. Mehta, J. Chu, X. Liu, R. Patel, S. Tellex, and G. Konidaris, “Generalizing to new domains by mapping natural language to lifted ltl,” in 2022 International Conference on Robotics and Automation (ICRA), pp. 3624–3630, IEEE, 2022.
- I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics tasks: A survey,” Robotics and Autonomous Systems, vol. 66, pp. 86–103, 2015.
- J. Crespo, J. C. Castillo, O. M. Mozos, and R. Barber, “Semantic information for robot navigation: A survey,” Applied Sciences, vol. 10, no. 2, p. 497, 2020.
- A. Pronobis, Semantic mapping with mobile robots. PhD thesis, KTH Royal Institute of Technology, 2011.
- E. Rosen, S. James, S. Orozco, V. Gupta, M. Merlin, S. Tellex, and G. Konidaris, “Synthesizing navigation abstractions for planning with portable manipulation skills,” in Conference on Robot Learning, pp. 2278–2287, PMLR, 2023.
- R. E. Fikes and N. J. Nilsson, “Strips: A new approach to the application of theorem proving to problem solving,” Artificial intelligence, vol. 2, no. 3-4, pp. 189–208, 1971.
- C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling, “Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning,” in Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 440–448, 2020.
- R. Holladay, T. Lozano-Pérez, and A. Rodriguez, “Planning for multi-stage forceful manipulation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6556–6562, IEEE, 2021.
- I. RAS, “Mobile Manipulation,” 2023. https://www.ieee-ras.org/mobile-manipulation.
- D. Batra, A. X. Chang, S. Chernova, A. J. Davison, J. Deng, V. Koltun, S. Levine, J. Malik, I. Mordatch, R. Mottaghi, et al., “Rearrangement: A challenge for embodied ai,” arXiv preprint arXiv:2011.01975, 2020.
- K. Gao, Y. Gao, H. He, D. Lu, L. Xu, and J. Li, “Nerf: Neural radiance field in 3d vision, a comprehensive review,” arXiv preprint arXiv:2210.00379, 2022.
- M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al., “Simple open-vocabulary object detection with vision transformers. arxiv 2022,” arXiv preprint arXiv:2205.06230.
- X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.
- T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7086–7096, 2022.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” arXiv preprint arXiv:2304.06718, 2023.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
- R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
- N. Yokoyama, A. W. Clegg, E. Undersander, S. Ha, D. Batra, and A. Rai, “Adaptive skill coordination for robotic mobile manipulation,” arXiv preprint arXiv:2304.00410, 2023.
- K. M. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari, et al., “Conceptfusion: Open-set multimodal 3d mapping,” arXiv preprint arXiv:2302.07241, 2023.
- A. Majid, M. Bowerman, S. Kita, D. B. Haun, and S. C. Levinson, “Can language restructure cognition? the case for space,” Trends in cognitive sciences, vol. 8, no. 3, pp. 108–114, 2004.
- L. Janson, E. Schmerling, A. Clark, and M. Pavone, “Fast marching tree: A fast marching sampling-based method for optimal motion planning in many dimensions,” The International journal of robotics research, vol. 34, no. 7, pp. 883–921, 2015.
- Benedict Quartey (8 papers)
- Eric Rosen (20 papers)
- Stefanie Tellex (45 papers)
- George Konidaris (71 papers)