iVideoGPT: Interactive VideoGPTs are Scalable World Models (2405.15223v3)
Abstract: World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Stochastic variational video prediction. In ICLR, 2018.
- Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021.
- Affordances from human videos as a versatile representation for robotics. In CVPR, 2023.
- Hydra: Hybrid robot actions for imitation learning. In CoRL, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- AudioLM: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- RT-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Video generation models as world simulators. 2024.
- Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024.
- Playfusion: Skill acquisition via diffusion from language-annotated play. In CoRL, 2023.
- Diffusion policy: Visuomotor policy learning via action diffusion. In RSS, 2023.
- From play to policy: Conditional behavior generation from uncurated robot data. In ICLR, 2023.
- RoboNet: Large-scale multi-robot learning. In CoRL, 2019.
- CLVR jaco play dataset, 2023.
- Stochastic video generation with a learned prior. In ICML, 2018.
- Video language planning. In ICLR, 2024.
- Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
- Self-supervised visual planning with temporal skip connections. In CoRL, 2017.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Finetuning offline world models in the real world. In CoRL, 2023.
- Deep visual foresight for planning robot motion. In ICRA, 2017.
- The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
- Maskvit: Masked visual pre-training for video prediction. In ICLR, 2023.
- Recurrent world models facilitate policy evolution. In NeurIPS, 2018.
- Dream to control: Learning behaviors by latent imagination. In ICLR, 2020.
- Learning latent dynamics for planning from pixels. In ICML, 2019.
- Mastering atari with discrete world models. In ICLR, 2021.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Temporal difference learning for model predictive control. In ICML, 2022.
- Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In RSS, 2023.
- Deep q-learning from demonstrations. In AAAI, 2018.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Video diffusion models. In NeurIPS, 2022.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Scope of validity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801, 2008.
- BC-z: Zero-shot task generalization with robotic imitation learning. In CoRL, 2021.
- When to trust your model: Model-based policy optimization. In NeurIPS, 2019.
- Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:2402.03161, 2024.
- Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
- Model-based reinforcement learning for atari. In ICLR, 2020.
- Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, 2018.
- Robodesk: A multi-task reinforcement learning benchmark. https://github.com/google-research/robodesk, 2021.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Pre-and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer. In IROS, 2023.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Didier Le Gall. Mpeg: A video compression standard for multimedia applications. Communications of the ACM, 34(4):46–58, 1991.
- Ccvs: Context-aware controllable video synthesis. In NeurIPS, 2021.
- Yann LeCun. A path towards autonomous machine intelligence. preprint posted on openreview, 2022.
- Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In ICRA, 2019.
- World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268, 2024.
- Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
- Harmonydream: Task harmonization inside world models. In ICML, 2024.
- Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In IROS, 2019.
- Weblab xArm Dataset, 2023.
- Grounding language with visual affordances over unstructured data. In ICRA, 2023.
- Structured world models from human videos. In RSS, 2023.
- Transformers are sample efficient world models. In ICLR, 2023.
- Unsupervised learning of object structure and dynamics from videos. In NeurIPS, 2019.
- Learning and retrieval from prior data for skill-based imitation learning. In CoRL, 2022.
- Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015.
- Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
- X-Embodiment U-Tokyo PR2 Datasets, 2023.
- Action-conditional video prediction using deep networks in atari games. In NeurIPS, 2015.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- A guided reinforcement learning approach using shared control templates for learning manipulation skills in the real world. Research square preprint rs-3289569/v1, 2023.
- Guiding reinforcement learning with shared control templates. In ICRA, 2023.
- amused: An open muse reproduction. arXiv preprint arXiv:2401.01808, 2024.
- Shared Control Templates for Assistive Robotics. In ICRA, 2020.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Latent plans for task agnostic offline reinforcement learning. In CoRL, 2022.
- Playing with food: Learning food item representations through interactive exploration. In Experimental Robotics: The 17th International Symposium, pages 309–322. Springer, 2021.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- Reinforcement learning with action-free pre-training from videos. In ICML, 2022.
- MUTEX: Learning unified policies from multimodal task specifications. In CoRL, 2023.
- Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NeurIPS, 2015.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- A control-centric benchmark for video prediction. In ICLR, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Neural discrete representation learning. In NeurIPS, 2017.
- Attention is all you need. In NeurIPS, 2017.
- High fidelity video prediction with large stochastic recurrent neural networks. In NeurIPS, 2019.
- Edan - an emg-controlled daily assistant to help people with physical disabilities. In IROS, 2020.
- Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. NeurIPS, 2022.
- Bridgedata v2: A dataset for robot learning at scale. In CoRL, 2023.
- D3fields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation. arXiv preprint arXiv:2309.16118, 2023.
- Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2208–2225, 2022.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Greedy hierarchical variational autoencoders for large-scale video prediction. In CVPR, 2021.
- Pre-training contextualized world models with in-the-wild videos for reinforcement learning. In NeurIPS, 2023.
- UCSD Kitchens Dataset. August 2023.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Learning interactive real-world simulators. In ICLR, 2024.
- Mastering visual continuous control: Improved data-augmented reinforcement learning. In ICLR, 2022.
- MAGVIT: Masked generative video transformer. In CVPR, 2023.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In CoRL, 2020.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Controlvideo: Training-free controllable text-to-video generation. In ICLR, 2024.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Train offline, test online: A real robot learning benchmark. In ICRA, 2023.
- Learning modular language-conditioned robot policies through attention. Autonomous Robots, pages 1–21, 2023.
- Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation. In CoRL, 2023.
- Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200id robot. 2023.
- Viola: Imitation learning for vision-based manipulation with object proposal priors. CoRL, 2022.
- Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022.
- robosuite: A modular simulation framework and benchmark for robot learning, 2022.