Photorealistic Video Generation with Diffusion Models (2312.06662v1)
Abstract: We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.
- MusicLM: Generating music from text. arXiv:2301.11325, 2023.
- All are worth words: a vit backbone for score-based diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
- Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023.
- Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2018.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- A short note about Kinetics-600. arXiv:1808.01340, 2018.
- MaskGIT: Masked generative image transformer. In CVPR, 2022.
- Muse: Text-to-image generation via masked generative transformers. In ICML, 2023.
- Pixart-\alpha\absent𝑎𝑙𝑝ℎ𝑎\backslash alpha\ italic_a italic_l italic_p italic_h italic_a: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
- Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022.
- Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
- CogView: Mastering text-to-image generation via transformers. In NeurIPS, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- A learned representation for artistic style. arXiv preprint arXiv:1610.07629, 2016.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
- Masked diffusion transformer is a strong image synthesizer. arXiv:2303.14389, 2023.
- Long video generation with time-agnostic VQGAN and time-sensitive transformer. In ECCV, 2022.
- Preserve your own correlation: A noise prior for video diffusion models. arXiv preprint arXiv:2305.10474, 2023.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Google. PaLM 2 technical report. arXiv:2305.10403, 2023.
- MaskViT: Masked visual pre-training for video prediction. In ICLR, 2022.
- Siamese masked autoencoders. arXiv preprint arXiv:2305.14344, 2023.
- Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 35:27953–27965, 2022.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2023.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022a.
- Cascaded diffusion models for high fidelity image generation. JMLR, 23(1):2249–2281, 2022b.
- Video diffusion models. In ICLR Workshops, 2022c.
- CogVideo: Large-scale pretraining for text-to-video generation via transformers. arXiv:2205.15868, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023.
- Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, 2021.
- Lora: Low-rank adaptation of large language models. In ICLR, 2021.
- Scalable adaptive computation for iterative generation. In ICML, 2023.
- Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34:14745–14758, 2021.
- Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Understanding the diffusion objective as a weighted integral of elbos. arXiv:2303.00848, 2023.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Vitgan: Training gans with vision transformers. arXiv preprint arXiv:2107.04589, 2021.
- Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Vdt: General-purpose video diffusion transformers via mask modeling. arXiv preprint arXiv:2305.13311, 2023.
- Transformation-based adversarial video prediction on large-scale data. arXiv:2003.04035, 2020.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Scalable diffusion models with transformers. arXiv:2212.09748, 2022.
- Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, 2018.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
- Zero-shot text-to-image generation. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Improved techniques for training GANs. In NeurIPS, 2016.
- Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021.
- Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402, 2012.
- Towards accurate generative models of video: A new metric & challenges. arXiv:1812.01717, 2018.
- Neural discrete representation learning. In NeurIPS, 2017.
- Attention is all you need. In NeurIPS, 2017.
- Phenaki: Variable length video generation from open domain textual description. arXiv:2210.02399, 2022.
- Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
- Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision, pages 720–736. Springer, 2022.
- VideoGPT: Video generation using VQ-VAE and transformers. arXiv:2104.10157, 2021.
- Vector-quantized image modeling with improved VQGAN. In ICLR, 2022a.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022b.
- MAGVIT: Masked generative video transformer. In CVPR, 2023a.
- Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023b.
- Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023c.
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
- Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11304–11314, 2022.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- Fast training of diffusion models with masked transformers. arXiv:2306.09305, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- RT-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023.