Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Verifiably Following Complex Robot Instructions with Foundation Models (2402.11498v2)

Published 18 Feb 2024 in cs.RO and cs.AI

Abstract: Enabling mobile robots to follow complex natural language instructions is an important yet challenging problem. People want to flexibly express constraints, refer to arbitrary landmarks and verify behavior when instructing robots. Conversely, robots must disambiguate human instructions into specifications and ground instruction referents in the real world. We propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow expressive and complex open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives and affords the synthesis of robot behaviors that are correct-by-construction. We perform a large scale evaluation and demonstrate our approach on 150 instructions in five real-world environments showing the generality of our approach and the ease of deployment in novel unstructured domains. In our experiments, LIMP performs comparably with state-of-the-art LLM task planners and LLM code-writing planners on standard open vocabulary tasks and additionally achieves 79\% success rate on complex spatiotemporal instructions while LLM and Code-writing planners both achieve 38\%. See supplementary materials and demo videos at https://robotlimp.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Y. Hu, Q. Xie, V. Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y. Xie, T. Zhang, Z. Zhao, et al., “Toward general-purpose robots via foundation models: A survey and meta-analysis,” arXiv preprint arXiv:2312.08782, 2023.
  2. R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman, et al., “Foundation models in robotics: Applications, challenges, and the future,” arXiv preprint arXiv:2312.07843, 2023.
  3. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23171–23181, 2023.
  4. D. Shah, B. Osiński, S. Levine, et al., “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in Conference on Robot Learning, pp. 492–504, PMLR, 2023.
  5. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 10608–10615, IEEE, 2023.
  6. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Audio visual language maps for robot navigation,” arXiv preprint arXiv:2303.07522, 2023.
  7. J. X. Liu, Z. Yang, I. Idrees, S. Liang, B. Schornstein, S. Tellex, and A. Shah, “Grounding complex natural language commands for temporal tasks in unseen environments,” in Conference on Robot Learning, pp. 1084–1110, PMLR, 2023.
  8. P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.
  9. S. Yenamandra, A. Ramachandran, K. Yadav, A. Wang, M. Khanna, T. Gervet, T.-Y. Yang, V. Jain, A. W. Clegg, J. Turner, et al., “Homerobot: Open-vocabulary mobile manipulation,” arXiv preprint arXiv:2306.11565, 2023.
  10. B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11509–11522, IEEE, 2023.
  11. J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493–9500, IEEE, 2023.
  12. E. A. Emerson, “Temporal and modal logic,” in Formal Models and Semantics, pp. 995–1072, Elsevier, 1990.
  13. J. Pan, G. Chou, and D. Berenson, “Data-efficient learning of natural language to linear temporal logic translators for robot task specification,” arXiv preprint arXiv:2303.08006, 2023.
  14. B. Quartey, A. Shah, and G. Konidaris, “Exploiting contextual structure to generate useful auxiliary tasks,” arXiv preprint arXiv:2303.05038, 2023.
  15. C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Pérez, “Integrated task and motion planning,” Annual review of control, robotics, and autonomous systems, vol. 4, pp. 265–293, 2021.
  16. J. X. Liu, Z. Yang, I. Idrees, S. Liang, B. Schornstein, S. Tellex, and A. Shah, “Lang2ltl: Translating natural language commands to temporal robot task specification,” arXiv preprint arXiv:2302.11649, 2023.
  17. M. Berg, D. Bayazit, R. Mathew, A. Rotter-Aboyoun, E. Pavlick, and S. Tellex, “Grounding language to landmarks in arbitrary outdoor environments,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 208–215, IEEE, 2020.
  18. M. Cosler, C. Hahn, D. Mendoza, F. Schmitt, and C. Trippel, “nl2spec: Interactively translating unstructured natural language to temporal logics with large language models,” arXiv preprint arXiv:2303.04864, 2023.
  19. F. Fuggitti and T. Chakraborti, “Nl2ltl–a python package for converting natural language (nl) instructions to linear temporal logic (ltl) formulas,” in AAAI Conference on Artificial Intelligence, 2023.
  20. Y. Chen, R. Gandhi, Y. Zhang, and C. Fan, “Nl2tl: Transforming natural languages to temporal logics using large language models,” arXiv preprint arXiv:2305.07766, 2023.
  21. A. Pnueli, “The temporal logic of programs,” in Proceedings of the 18th Annual Symposium on Foundations of Computer Science, SFCS ’77, p. 46–57, IEEE Computer Society, 1977.
  22. F. Bacchus and F. Kabanza, “Using temporal logics to express search control knowledge for planning,” Artificial intelligence, vol. 116, no. 1-2, pp. 123–191, 2000.
  23. M. Y. Vardi, “An automata-theoretic approach to linear temporal logic,” Logics for concurrency: structure versus automata, pp. 238–266, 2005.
  24. H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-logic-based reactive mission and motion planning,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1370–1381, 2009.
  25. M. Colledanchise, R. M. Murray, and P. Ögren, “Synthesis of correct-by-construction behavior trees,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6039–6046, 2017.
  26. S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that use language,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 25–55, 2020.
  27. V. Blukis, R. A. Knepper, and Y. Artzi, “Few-shot object grounding and mapping for natural language robot instruction following,” arXiv preprint arXiv:2011.07384, 2020.
  28. R. Patel, R. Pavlick, and S. Tellex, “Learning to ground language to temporal logical form,” in NAACL, 2019.
  29. C. Wang, C. Ross, Y.-L. Kuo, B. Katz, and A. Barbu, “Learning a natural-language to ltl executable semantic parser for grounded robotics,” in Conference on Robot Learning, pp. 1706–1718, PMLR, 2021.
  30. K. Zheng, D. Bayazit, R. Mathew, E. Pavlick, and S. Tellex, “Spatial language understanding for object search in partially observed city-scale environments,” in 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), pp. 315–322, IEEE, 2021.
  31. X. Wang, W. Wang, J. Shao, and Y. Yang, “Lana: A language-capable navigator for instruction following and generation,” arXiv preprint arXiv:2303.08409, 2023.
  32. S.-M. Park and Y.-G. Kim, “Visual language navigation: A survey and open challenges,” Artificial Intelligence Review, vol. 56, no. 1, pp. 365–427, 2023.
  33. B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” arXiv preprint arXiv:2209.09874, 2022.
  34. B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” arXiv preprint arXiv:2304.05501, 2023.
  35. C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” arXiv preprint arXiv:2212.04088, 2022.
  36. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” arXiv preprint arXiv:2210.05714, 2022.
  37. J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” arXiv preprint arXiv:2209.07753, 2022.
  38. E. Hsiung, H. Mehta, J. Chu, X. Liu, R. Patel, S. Tellex, and G. Konidaris, “Generalizing to new domains by mapping natural language to lifted ltl,” in 2022 International Conference on Robotics and Automation (ICRA), pp. 3624–3630, IEEE, 2022.
  39. I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics tasks: A survey,” Robotics and Autonomous Systems, vol. 66, pp. 86–103, 2015.
  40. J. Crespo, J. C. Castillo, O. M. Mozos, and R. Barber, “Semantic information for robot navigation: A survey,” Applied Sciences, vol. 10, no. 2, p. 497, 2020.
  41. A. Pronobis, Semantic mapping with mobile robots. PhD thesis, KTH Royal Institute of Technology, 2011.
  42. E. Rosen, S. James, S. Orozco, V. Gupta, M. Merlin, S. Tellex, and G. Konidaris, “Synthesizing navigation abstractions for planning with portable manipulation skills,” in Conference on Robot Learning, pp. 2278–2287, PMLR, 2023.
  43. R. E. Fikes and N. J. Nilsson, “Strips: A new approach to the application of theorem proving to problem solving,” Artificial intelligence, vol. 2, no. 3-4, pp. 189–208, 1971.
  44. C. R. Garrett, T. Lozano-Pérez, and L. P. Kaelbling, “Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning,” in Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 440–448, 2020.
  45. R. Holladay, T. Lozano-Pérez, and A. Rodriguez, “Planning for multi-stage forceful manipulation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6556–6562, IEEE, 2021.
  46. I. RAS, “Mobile Manipulation,” 2023. https://www.ieee-ras.org/mobile-manipulation.
  47. D. Batra, A. X. Chang, S. Chernova, A. J. Davison, J. Deng, V. Koltun, S. Levine, J. Malik, I. Mordatch, R. Mottaghi, et al., “Rearrangement: A challenge for embodied ai,” arXiv preprint arXiv:2011.01975, 2020.
  48. K. Gao, Y. Gao, H. He, D. Lu, L. Xu, and J. Li, “Nerf: Neural radiance field in 3d vision, a comprehensive review,” arXiv preprint arXiv:2210.00379, 2022.
  49. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al., “Simple open-vocabulary object detection with vision transformers. arxiv 2022,” arXiv preprint arXiv:2205.06230.
  50. X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.
  51. T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7086–7096, 2022.
  52. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  53. X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” arXiv preprint arXiv:2304.06718, 2023.
  54. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  55. G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
  56. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  57. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
  58. R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
  59. N. Yokoyama, A. W. Clegg, E. Undersander, S. Ha, D. Batra, and A. Rai, “Adaptive skill coordination for robotic mobile manipulation,” arXiv preprint arXiv:2304.00410, 2023.
  60. K. M. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari, et al., “Conceptfusion: Open-set multimodal 3d mapping,” arXiv preprint arXiv:2302.07241, 2023.
  61. A. Majid, M. Bowerman, S. Kita, D. B. Haun, and S. C. Levinson, “Can language restructure cognition? the case for space,” Trends in cognitive sciences, vol. 8, no. 3, pp. 108–114, 2004.
  62. L. Janson, E. Schmerling, A. Clark, and M. Pavone, “Fast marching tree: A fast marching sampling-based method for optimal motion planning in many dimensions,” The International journal of robotics research, vol. 34, no. 7, pp. 883–921, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Benedict Quartey (8 papers)
  2. Eric Rosen (20 papers)
  3. Stefanie Tellex (45 papers)
  4. George Konidaris (71 papers)
Citations (7)

Summary

  • The paper introduces LIMP, a system that translates natural language instructions into enriched temporal logic via a two-stage prompting method.
  • It employs dynamic semantic mapping with visual language models to create Referent Semantic Maps for precise object localization.
  • The Progressive Motion Planner integrates finite-state automata with task and motion planning, achieving 90% navigation and 71% manipulation success rates.

Verifiably Following Complex Robot Instructions with Foundation Models

The paper "Verifiably Following Complex Robot Instructions with Foundation Models" introduces Language Instruction grounding for Motion Planning (LIMP), a system designed to enable robots to interpret and execute complex natural language instructions. This approach utilizes foundation models and temporal logics to accommodate instructions involving spatiotemporal constraints and open vocabulary referents.

Key Contributions

  1. Instruction Translation into Temporal Logic: LIMP translates natural language instructions into temporal logic specifications using LLMs. This involves a two-stage prompting technique that initially maps instructions into traditional linear temporal logic (LTL) forms and then transforms them into a syntax enriched with Composible Referent Descriptors (CRDs). These CRDs encode descriptive spatial relationships, enabling nuanced referent disambiguation.
  2. Dynamic Semantic Mapping: The system generates Referent Semantic Maps (RSM) to localize specific object instances based on resolved spatial relationships outlined in the translated instructions. This involves leveraging visual LLMs (VLMs) to detect object occurrences and apply spatial reasoning to refine these detections.
  3. Task and Motion Planning (TAMP): The paper proposes a Progressive Motion Planner that employs finite-state automata to compile temporal logic into actionable tasks. This planner coordinates navigation and manipulation skills dynamically, restructuring the environment map into Task Progression Semantic Maps (TPSM) for real-time path planning. The approach guarantees correct-by-construction behavior through goal-directed and constraint-aware navigation.

Strong Numerical Results

The system was tested on 35 complex real-world instructions, yielding impressive results: a 90% success rate in object-goal navigation and 71% in mobile manipulation tasks. The two-stage prompting approach leveraging semantically similar in-context examples demonstrated superior performance across several metrics, including referent resolution accuracy and temporal alignment accuracy, outperforming single-stage and random example selection baseline methods.

Implications and Future Work

The theoretical and practical implications of LIMP are noteworthy. Practically, it provides a robust framework for robots to interpret and act upon human instructions in diverse, unstructured environments, without requiring pre-established semantic maps. Theoretically, it underscores the potential of interfacing foundation models with traditional planning frameworks, enhancing the explainability and alignment of robot behaviors.

Future work could address limitations such as non-reactivity in dynamic environments and extend capabilities to handle non-finite instruction sequences. Furthermore, refining the optimality of the planning process and exploring the integration of more complex manipulation strategies would continue to enhance the system’s robustness and applicability.

LIMP represents a progressive step in developing verifiable, reliable robotic systems capable of nuanced understanding and execution of human instructions in real-world scenarios. The paper effectively demonstrates the benefits of combining modern foundational models with classical planning methodologies, offering a promising avenue for advancements in robotic autonomy.

Youtube Logo Streamline Icon: https://streamlinehq.com