WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens (2401.09985v1)
Abstract: World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of LLMs, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Language models are few-shot learners. NeurIPS, 2020.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Brandon Castellano. Pyscenedetect. Github repository, 2020.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Maskgit: Masked generative image transformer. In CVPR, 2022.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
- Generative pretraining from pixels. 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Cogview: Mastering text-to-image generation via transformers. NeurIPS, 2021.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. NIPS, 2022.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
- Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023.
- Recurrent world models facilitate policy evolution. NeurIPS, 2018.
- Deep hierarchical planning from pixels. NeurIPS, 2022.
- Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
- Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
- Diffit: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139, 2023.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
- Adriver-i: A general world model for autonomous driving. arXiv preprint arXiv:2311.13549, 2023.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
- Learning to model the world with language. arXiv preprint arXiv:2308.01399, 2023.
- amused: An open muse reproduction. arXiv preprint arXiv:2401.01808, 2024.
- Improving language understanding by generative pre-training. OpenAI, 2018.
- Language models are unsupervised multitask learners. OpenAI, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
- Zero-shot text-to-image generation. In ICML, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NIPS, 2022.
- Masked world models for visual control. In CoRL, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Neural discrete representation learning. NeurIPS, 2017.
- Attention is all you need. NIPS, 2017.
- Phenaki: Variable length video generation from open domain textual description. ICLR, 2023.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
- Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023.
- Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. arXiv preprint arXiv:2311.17918, 2023.
- On the de-duplication of laion-2b. arXiv preprint arXiv:2303.12733, 2023.
- Daydreamer: World models for physical robot learning. In CoRL, 2023.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Magvit: Masked generative video transformer. In CVPR, 2023.
- Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023.
- Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.