iVideoGPT: Interactive VideoGPTs are Scalable World Models (2405.15223v3)
Abstract: World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Stochastic variational video prediction. In ICLR, 2018.
- Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021.
- Affordances from human videos as a versatile representation for robotics. In CVPR, 2023.
- Hydra: Hybrid robot actions for imitation learning. In CoRL, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- AudioLM: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- RT-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Video generation models as world simulators. 2024.
- Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024.
- Playfusion: Skill acquisition via diffusion from language-annotated play. In CoRL, 2023.
- Diffusion policy: Visuomotor policy learning via action diffusion. In RSS, 2023.
- From play to policy: Conditional behavior generation from uncurated robot data. In ICLR, 2023.
- RoboNet: Large-scale multi-robot learning. In CoRL, 2019.
- CLVR jaco play dataset, 2023.
- Stochastic video generation with a learned prior. In ICML, 2018.
- Video language planning. In ICLR, 2024.
- Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
- Self-supervised visual planning with temporal skip connections. In CoRL, 2017.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Finetuning offline world models in the real world. In CoRL, 2023.
- Deep visual foresight for planning robot motion. In ICRA, 2017.
- The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
- Maskvit: Masked visual pre-training for video prediction. In ICLR, 2023.
- Recurrent world models facilitate policy evolution. In NeurIPS, 2018.
- Dream to control: Learning behaviors by latent imagination. In ICLR, 2020.
- Learning latent dynamics for planning from pixels. In ICML, 2019.
- Mastering atari with discrete world models. In ICLR, 2021.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Temporal difference learning for model predictive control. In ICML, 2022.
- Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In RSS, 2023.
- Deep q-learning from demonstrations. In AAAI, 2018.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Video diffusion models. In NeurIPS, 2022.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Scope of validity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801, 2008.
- BC-z: Zero-shot task generalization with robotic imitation learning. In CoRL, 2021.
- When to trust your model: Model-based policy optimization. In NeurIPS, 2019.
- Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:2402.03161, 2024.
- Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
- Model-based reinforcement learning for atari. In ICLR, 2020.
- Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, 2018.
- Robodesk: A multi-task reinforcement learning benchmark. https://github.com/google-research/robodesk, 2021.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer. In IROS, 2023.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Didier Le Gall. Mpeg: A video compression standard for multimedia applications. Communications of the ACM, 34(4):46–58, 1991.
- Ccvs: Context-aware controllable video synthesis. In NeurIPS, 2021.
- Yann LeCun. A path towards autonomous machine intelligence. preprint posted on openreview, 2022.
- Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In ICRA, 2019.
- World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268, 2024.
- Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
- Harmonydream: Task harmonization inside world models. In ICML, 2024.
- Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In IROS, 2019.
- Weblab xArm Dataset, 2023.
- Grounding language with visual affordances over unstructured data. In ICRA, 2023.
- Structured world models from human videos. In RSS, 2023.
- Transformers are sample efficient world models. In ICLR, 2023.
- Unsupervised learning of object structure and dynamics from videos. In NeurIPS, 2019.
- Learning and retrieval from prior data for skill-based imitation learning. In CoRL, 2022.
- Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015.
- Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
- X-Embodiment U-Tokyo PR2 Datasets, 2023.
- Action-conditional video prediction using deep networks in atari games. In NeurIPS, 2015.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- A guided reinforcement learning approach using shared control templates for learning manipulation skills in the real world. Research square preprint rs-3289569/v1, 2023.
- Guiding reinforcement learning with shared control templates. In ICRA, 2023.
- amused: An open muse reproduction. arXiv preprint arXiv:2401.01808, 2024.
- Shared Control Templates for Assistive Robotics. In ICRA, 2020.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Latent plans for task agnostic offline reinforcement learning. In CoRL, 2022.
- Playing with food: Learning food item representations through interactive exploration. In Experimental Robotics: The 17th International Symposium, pages 309–322. Springer, 2021.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- Reinforcement learning with action-free pre-training from videos. In ICML, 2022.
- MUTEX: Learning unified policies from multimodal task specifications. In CoRL, 2023.
- Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NeurIPS, 2015.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- A control-centric benchmark for video prediction. In ICLR, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Neural discrete representation learning. In NeurIPS, 2017.
- Attention is all you need. In NeurIPS, 2017.
- High fidelity video prediction with large stochastic recurrent neural networks. In NeurIPS, 2019.
- Edan - an emg-controlled daily assistant to help people with physical disabilities. In IROS, 2020.
- Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 2022.
- Bridgedata v2: A dataset for robot learning at scale. In CoRL, 2023.
- D3fields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation. arXiv preprint arXiv:2309.16118, 2023.
- Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Greedy hierarchical variational autoencoders for large-scale video prediction. In CVPR, 2021.
- Pre-training contextualized world models with in-the-wild videos for reinforcement learning. In NeurIPS, 2023.
- UCSD Kitchens Dataset. August 2023.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Learning interactive real-world simulators. In ICLR, 2024.
- Mastering visual continuous control: Improved data-augmented reinforcement learning. In ICLR, 2022.
- MAGVIT: Masked generative video transformer. In CVPR, 2023.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, 2020.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Controlvideo: Training-free controllable text-to-video generation. In ICLR, 2024.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Train offline, test online: A real robot learning benchmark. In ICRA, 2023.
- Learning modular language-conditioned robot policies through attention. Autonomous Robots, pages 1–21, 2023.
- Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation. In CoRL, 2023.
- Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot. 2023.
- Viola: Imitation learning for vision-based manipulation with object proposal priors. CoRL, 2022.
- Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022.
- robosuite: A modular simulation framework and benchmark for robot learning, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.