Scaling Instructable Agents Across Many Simulated Worlds (2404.10179v3)
Abstract: Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.
- Imitating Interactive Intelligence. arXiv preprint arXiv:2012.05672, 2020.
- Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback. arXiv preprint arXiv:2211.11602, 2022a.
- Evaluating Multimodal Interactive Agents. arXiv preprint arXiv:2205.13274, 2022b.
- Human-Timescale Adaptation in an Open-Ended Task Space. In International Conference on Machine Learning, 2023.
- Compositional Foundation Models for Hierarchical Planning. In Advances in Neural Information Processing Systems, 2023.
- Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds. In Advances in Neural Information Processing Systems, 2022.
- PaLM 2 Technical Report. arXiv preprint arXiv:2305.10403, 2023.
- Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos. In Advances in Neural Information Processing Systems, 2022.
- The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680, 2019.
- Improving fine-grained understanding in image-text pre-training. arXiv preprint arXiv:2401.09865, 2024.
- Behavioural Cloning: Phenomena, Results and Problems. IFAC Proceedings Volumes, 28(21):143–149, 1995.
- RT-1: Robotics Transformer for Real-World Control at Scale. arXiv preprint arXiv:2212.06817, 2022.
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv preprint arXiv:2307.15818, 2023a.
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. In Conference on Robot Learning, 2023b.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, 2020.
- Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration. In Advances in Neural Information Processing Systems, 2020.
- Language and culture internalization for human-like autotelic AI. Nature Machine Intelligence, 4(12):1068–1076, 2022.
- PyBullet, a Python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016.
- Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Association for Computational Linguistics, 2019.
- Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning. arXiv preprint arXiv:2112.03763, 2021.
- ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In Advances in Neural Information Processing Systems, 2022.
- CARLA: An Open Urban Driving Simulator. In Conference on Robot Learning, 2017.
- PaLM-E: An Embodied Multimodal Language Model. arXiv preprint arXiv:2303.03378, 2023.
- An Interactive Agent Foundation Model. arXiv preprint arXiv:2402.05929, 2024.
- IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. In International Conference on Machine Learning, 2018.
- MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. In Advances in Neural Information Processing Systems, 2022.
- Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805, 2023.
- Shaping Belief States with Generative Environment Models for RL. In Advances in Neural Information Processing Systems, 2019.
- Making Efficient Use of Demonstrations to Solve Hard Exploration Problems. In International Conference on Learning Representations, 2019.
- MineRL: A Large-Scale Dataset of Minecraft Demonstrations. In International Joint Conference on Artificial Intelligence, 2019.
- Recurrent World Models Facilitate Policy Evolution. In Advances in Neural Information Processing Systems, 2018.
- Mastering Atari with Discrete World Models. In International Conference on Learning Representations, 2020.
- Mastering Diverse Domains through World Models. arXiv preprint arXiv:2301.04104, 2023.
- Stevan Harnad. The Symbol Grounding Problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990.
- Grounded Language Learning in a Simulated 3D World. arXiv preprint arXiv:1706.06551, 2017.
- Environmental drivers of systematicity and generalization in a situated agent. In International Conference on Learning Representations, 2019.
- Grounded Language Learning Fast and Slow. In International Conference on Learning Representations, 2020.
- Classifier-Free Diffusion Guidance. arXiv preprint arXiv:2207.12598, 2022.
- Sim2Real in Robotics and Automation: Applications and Challenges. IEEE Transactions on Automation Science and Engineering, 18(2):398–400, 2021.
- Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556, 2022.
- Thought Cloning: Learning to Think while Acting by Imitating Human Thinking. arXiv preprint arXiv:2306.00323, 2023.
- Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning. arXiv preprint arXiv:2311.17842, 2023.
- An Embodied Generalist Agent in 3D World. arXiv preprint arXiv:2311.12871, 2023.
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In International Conference on Machine Learning, 2022.
- A data-driven approach for learning to control computers. In International Conference on Machine Learning, 2022.
- Language as an Abstraction for Hierarchical Deep Reinforcement Learning. In Advances in Neural Information Processing Systems, 2019.
- The Malmo Platform for Artificial Intelligence Experimentation. In International Joint Conference on Artificial Intelligence, 2016.
- Language Models can Solve Computer Tasks. In Advances in Neural Information Processing Systems, 2023.
- VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv preprint arXiv:2401.13649, 2024.
- AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv preprint arXiv:1712.05474, 2017.
- Using Natural Language and Program Abstractions to Instill Human Inductive Biases in Machines. In Advances in Neural Information Processing Systems, 2022.
- Tell me why! Explanations support learning relational and causal structure. In International Conference on Machine Learning, 2022.
- Competition-Level Code Generation with AlphaCode. Science, 378(6624):1092–1097, 2022.
- STEVE-1: A Generative Model for Text-to-Behavior in Minecraft. arXiv preprint arXiv:2306.00937, 2023.
- Isaac Gym: High Performance GPU Based Physics Simulation For Robot Learning. In Advances in Neural Information Processing Systems, 2021.
- Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models. Proceedings of the National Academy of Sciences, 117(42):25966–25974, 2020.
- Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Hans Moravec. Mind Children: The Future of Robot and Human Intelligence. Harvard University Press, 1988.
- Improving Intrinsic Exploration with Language Abstractions. In Advances in Neural Information Processing Systems, 2022.
- Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling. arXiv preprint arXiv:2301.12050, 2023.
- Open-Ended Learning Leads to Generally Capable Agents. arXiv preprint arXiv:2107.12808, 2021.
- OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv preprint arXiv:2310.08864, 2023.
- Counter-Strike Deathmatch with Large-Scale Behavioural Cloning. In IEEE Conference on Games, 2022.
- VirtualHome: Simulating Household Activities via Programs. In Computer Vision and Pattern Recognition, 2018.
- Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots. arXiv preprint arXiv:2310.13724, 2023.
- A Generalist Agent. Transactions on Machine Learning Research, 2022.
- Stay on topic with Classifier-Free Guidance. arXiv preprint arXiv:2306.17806, 2023.
- Habitat: A Platform for Embodied AI Research. In International Conference on Computer Vision, 2019.
- ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In Computer Vision and Pattern Recognition, 2020.
- Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
- A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.
- BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments. In Conference in Robot Learning, 2021.
- Open-World Object Manipulation using Pre-trained Vision-Language Models. arXiv preprint arXiv:2303.00905, 2023.
- Habitat 2.0: Training Home Assistants to Rearrange their Habitat. In Advances in Neural Information Processing Systems, 2021.
- Semantic Exploration from Language Abstractions and Pretrained Representations. In Advances in Neural Information Processing Systems, 2022.
- Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study. arXiv preprint arXiv:2403.03186, 2024.
- Gerald Tesauro et al. Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3):58–68, 1995.
- A Deep Hierarchical Approach to Lifelong Learning in Minecraft. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
- MuJoCo: A physics engine for model-based control. In IEEE International Conference on Intelligent Robots and Systems, 2012.
- ChatGPT for Robotics: Design Principles and Model Abilities. arXiv preprint arXiv:2306.17582, 2023.
- Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions. In International Conference on Learning Representations, 2022.
- Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv:2305.16291, 2023a.
- JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models. arXiv preprint arXiv:2311.05997, 2023b.
- Using Unity to Help Solve Intelligence. arXiv preprint arXiv:2011.09294, 2020.
- Learning Interactive Real-World Simulators. arXiv preprint arXiv:2310.06114, 2023.
- Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning. In Conference on Robot Learning, 2020.
- Transporter Networks: Rearranging the Visual World for Robotic Manipulation. In Conference on Robot Learning, 2021.
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. In International Conference on Learning Representations, 2022.
- GATS: Gather-Attend-Scatter. arXiv preprint arXiv:2401.08525, 2024.