SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks (2305.17390v2)
Abstract: We introduce SwiftSage, a novel agent framework inspired by the dual-process theory of human cognition, designed to excel in action planning for complex interactive reasoning tasks. SwiftSage integrates the strengths of behavior cloning and prompting LLMs to enhance task completion performance. The framework comprises two primary modules: the Swift module, representing fast and intuitive thinking, and the Sage module, emulating deliberate thought processes. The Swift module is a small encoder-decoder LM fine-tuned on the oracle agent's action trajectories, while the Sage module employs LLMs such as GPT-4 for subgoal planning and grounding. We develop a heuristic method to harmoniously integrate the two modules, resulting in a more efficient and robust problem-solving process. In 30 tasks from the ScienceWorld benchmark, SwiftSage significantly outperforms other methods such as SayCan, ReAct, and Reflexion, demonstrating its effectiveness in solving complex interactive tasks.
- Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, 2022.
- Graph constrained reinforcement learning for natural language action spaces. In International Conference on Learning Representations, 2020.
- How to motivate your dragon: Teaching goal-driven agents to speak and act in fantasy worlds. In North American Chapter of the Association for Computational Linguistics, 2020.
- Leveraging linguistic structure for open domain information extraction. In Annual Meeting of the Association for Computational Linguistics, 2015.
- Thinking fast and slow with deep learning and tree search. ArXiv, abs/1705.08439, 2017.
- Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv, abs/2303.12712, 2023.
- Deep reasoning networks: Thinking fast and slow. ArXiv, abs/1906.00855, 2019.
- Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems, 2021.
- Scaling instruction-finetuned language models. ArXiv, abs/2210.11416, 2022.
- Textworld: A learning environment for text-based games. In CGW@IJCAI, 2018.
- Thinking fast and slow in ai: the role of metacognition. In International Conference on Machine Learning, Optimization, and Data Science, 2021.
- Openagi: When llm meets domain experts. arXiv, 2023.
- Deep reinforcement learning with a natural language action space. arXiv: Artificial Intelligence, 2015.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ArXiv, abs/2201.07207, 2022.
- Daniel Kahneman. Thinking, Fast and Slow. 2011.
- AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
- On grounded planning for embodied tasks with language models. ArXiv, abs/2209.00465, 2022.
- Chameleon: Plug-and-play compositional reasoning with large language models. ArXiv, abs/2304.09842, 2023.
- Thinking fast and slow: Efficient text-to-visual retrieval with transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9821–9831, 2021.
- Text-based RL Agents with Commonsense Knowledge: New Challenges, Environments and Baselines. In Thirty Fifth AAAI Conference on Artificial Intelligence, 2021.
- Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling. In International Conference on Machine Learning (ICML), 2023.
- Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. In Neural Information Processing Systems, 2021.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- A generalist agent. ArXiv, abs/2205.06175, 2022.
- Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China, 2019. Association for Computational Linguistics.
- Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761, 2023.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. ArXiv, abs/2303.17580, 2023.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv, abs/2303.11366, 2023.
- Alfworld: Aligning text and embodied environments for interactive learning. ArXiv, abs/2010.03768, 2020.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. ArXiv, abs/2212.04088, 2022.
- Sequence to sequence learning with neural networks. ArXiv, abs/1409.3215, 2014.
- Behavioral cloning from observation. ArXiv, abs/1805.01954, 2018.
- Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
- Scienceworld: Is your agent smarter than a 5th grader? In Conference on Empirical Methods in Natural Language Processing, 2022.
- Interactive natural language processing. ArXiv, 2023.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. ArXiv, abs/2302.01560, 2023.
- Dual processes in reasoning? Cognition, 3(2):141–154, 1974.
- Keep calm and explore: Language models for action generation in text-based games. ArXiv, abs/2010.02903, 2020.
- React: Synergizing reasoning and acting in language models. ArXiv, abs/2210.03629, 2022.