RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks (2311.15649v3)
Abstract: Robotic agents must master common sense and long-term sequential decisions to solve daily tasks through natural language instruction. The developments in LLMs in natural language processing have inspired efforts to use LLMs in complex robot planning. Despite LLMs' great generalization and comprehension of instruction tasks, LLMs-generated task plans sometimes lack feasibility and correctness. To address the problem, we propose a RoboGPT agent\footnote{our code and dataset will be released soon} for making embodied long-term decisions for daily tasks, with two modules: 1) LLMs-based planning with re-plan to break the task into multiple sub-goals; 2) RoboSkill individually designed for sub-goals to learn better navigation and manipulation skills. The LLMs-based planning is enhanced with a new robotic dataset and re-plan, called RoboGPT. The new robotic dataset of 67k daily instruction tasks is gathered for fine-tuning the Llama model and obtaining RoboGPT. RoboGPT planner with strong generalization can plan hundreds of daily instruction tasks. Additionally, a low-computational Re-Plan module is designed to allow plans to flexibly adapt to the environment, thereby addressing the nomenclature diversity challenge. The proposed RoboGPT agent outperforms SOTA methods on the ALFRED daily tasks. Moreover, RoboGPT planner exceeds SOTA LLM-based planners like ChatGPT in task-planning rationality for hundreds of unseen daily tasks, and even other domain tasks, while keeping the large model's original broad application and generality.
- Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, CoRL, 2022.
- Multi-level compositional reasoning for interactive instruction following. In Brian Williams, Yiling Chen, and Jennifer Neville (eds.), Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI, pp. 223–231, 2023.
- A persistent spatial semantic representation for high-level natural language instruction execution. In Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pp. 706–717, 2021.
- Answer set programming at a glance. Communications of the ACM, 54(12):92–103, 2011.
- RT-2: vision-language-action models transfer web knowledge to robotic control. CoRR, abs/2307.15818, 2023a.
- RT-1: robotics transformer for real-world control at scale. In Robotics: Science and Systems, 2023b.
- Search-based planning for manipulation with motion primitives. In 2010 IEEE International Conference on Robotics and Automation, May 2010.
- Task-motion planning for safe and efficient urban driving. In IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, pp. 2119–2125, 2020.
- Task and motion planning with large language models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.
- Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning. In J. Christopher Beck, Olivier Buffet, Jörg Hoffmann, Erez Karpas, and Shirin Sohrabi (eds.), Proceedings of the Thirtieth International Conference on Automated Planning and Scheduling, pp. 440–448, 2020.
- Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pp. 9118–9147. PMLR, 2022.
- Prompter: Utilizing large language model prompting for a data efficient embodied instruction following. arXiv preprint arXiv:2211.03267, 2022.
- Code as policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation, ICRA, pp. 9493–9500, 2023.
- Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
- LEBP - language expectation & binding policy: A two-stream framework for embodied vision-and-language interaction task learning agents. CoRR, abs/2203.04637, 2022a.
- A planning based neural-symbolic approach for embodied instruction following. Interactions, 9(8):17, 2022b.
- Film: Following instructions in language with modular methods. In International Conference on Learning Representations, 2022.
- Following natural language instructions for household tasks with landmark guided search and reinforced pose adjustment. IEEE Robotics Autom. Lett., 7(3):6870–6877, 2022.
- Episodic transformer for vision-and-language navigation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 15922–15932. IEEE, 2021.
- Generalized planning as heuristic search. International Conference on Automated Planning and Scheduling,International Conference on Automated Planning and Scheduling, 2021.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10740–10749, 2020.
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523–11530. IEEE, 2023.
- One step at a time: Long-horizon vision-and-language navigation with milestones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15482–15491, 2022.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. arXiv preprint arXiv:2212.04088, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023b.
- Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 13484–13508, 2023.
- Grounding open-domain instructions to automate web support tasks. In Proceedings of the 2021 Conference of the North American Chapter of, 2021.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
- Socratic models: Composing zero-shot multimodal reasoning with language. In The Eleventh International Conference on Learning Representations, ICLR, 2023.
- Fast segment anything, 2023.