Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 41 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 178 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Lumiere: A Space-Time Diffusion Model for Video Generation (2401.12945v2)

Published 23 Jan 2024 in cs.CV

Abstract: We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. MultiDiffusion: Fusing diffusion paths for controlled image generation. In ICML, 2023.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023b.
  4. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pp.  6299–6308, 2017.
  5. Chen, T. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  6. Effectively unbiased FID and Inception Score and where to find them. In CVPR, pp.  6070–6079, 2020.
  7. 3d u-net: learning dense volumetric segmentation from sparse annotation. In MICCAI, pp.  424–432. Springer, 2016.
  8. Diffusion models in vision: A survey. IEEE T. Pattern Anal. Mach. Intell., 2023a.
  9. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  10. Shot durations, shot classes, and the increased pace of popular movies, 2015.
  11. Diffusion models beat gans on image synthesis. NeurIPS, 2021.
  12. Breathing life into sketches using text-to-video priors. arXiv preprint arXiv:2311.13608, 2023.
  13. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, pp.  22930–22941, 2023.
  14. Emu Video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  15. Matryoshka diffusion models. arXiv preprint arXiv:2310.15111, 2023.
  16. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  17. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  18. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  19. Video diffusion models, 2022b.
  20. CogVideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  21. Simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023.
  22. Style transfer by relaxed optimal transport and self-similarity. In CVPR, pp.  10051–10060, 2019.
  23. VideoPoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  24. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  25. Improved denoising diffusion probabilistic models. In ICML, pp.  8162–8171, 2021.
  26. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
  27. Pika labs. https://www.pika.art/, 2023.
  28. Resolution dependent GAN interpolation for controllable image synthesis between domains. In Machine Learning for Creativity and Design NeurIPS 2020 Workshop, 2020.
  29. State of the art on diffusion models for visual computing. arXiv preprint arXiv:2310.07204, 2023.
  30. DreamFusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
  31. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
  32. High-resolution image synthesis with latent diffusion models. In CVPR, pp.  10684–10695, 2022.
  33. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, pp.  234–241. Springer, 2015.
  34. RunwayML. Gen-2. https://research.runwayml.com/gen2, 2023.
  35. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp.  1–10, 2022a.
  36. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022b.
  37. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal GAN. Int. J. Comput. Vision, 128(10-11):2586–2606, 2020.
  38. Improved techniques for training GANs. NIPS, 29, 2016.
  39. Make-a-Video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  40. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pp.  2256–2265, 2015.
  41. StyleDrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
  42. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  43. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  44. A closer look at spatiotemporal convolutions for action recognition. In CVPR, pp.  6450–6459, 2018.
  45. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  46. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023.
  47. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  48. Videocomposer: Compositional video synthesis with motion controllability. arXiv preprint arXiv:2306.02018, 2023b.
  49. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, pp.  720–736. Springer, 2022.
  50. Inflation with diffusion: Efficient temporal adaptation for text-to-video super-resolution, 2024.
  51. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
  52. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023b.
  53. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp.  586–595, 2018.
  54. MagicVideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Citations (143)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a Space-Time U-Net architecture that generates full-frame videos with consistent motion in one pass.
  • It employs MultiDiffusion for spatial super-resolution, ensuring smooth and artifact-free upscaling across temporal segments.
  • User studies show improved temporal consistency and motion quality compared to baseline T2V models, validating its wide range of video editing applications.

Lumiere: A Space-Time Diffusion Model for Video Generation

Introduction

"Lumiere: A Space-Time Diffusion Model for Video Generation" introduces a novel approach to synthesizing videos that depict realistic, diverse, and coherent motion using a text-to-video diffusion model. It addresses the fundamental challenge of achieving global temporal consistency in video synthesis by proposing a Space-Time U-Net architecture capable of generating the entire temporal duration of a video in a single model pass.

Space-Time U-Net Architecture

The Lumiere model employs a Space-Time U-Net (STUNet) architecture to effectively handle video generation. This architecture facilitates the simultaneous processing of spatial and temporal dimensions, thereby enabling the generation of videos with consistent motion. Figure 1

Figure 1: STUNet architecture inflates a pre-trained T2I U-Net architecture into a Space-Time UNet, incorporating both spatial and temporal down- and up-sampling modules.

Unlike existing models that rely on cascaded temporal super-resolution which can lead to temporal inconsistencies, STUNet generates full-frame-rate videos end-to-end. This design choice, overlooked by previous methods, allows for the synthesis of coherent motion over longer video durations, up to 5 seconds at 16 fps.

Video Generation Pipeline

Lumiere's video generation pipeline consists of two main components: a base model for generating low-resolution video clips and a spatial super-resolution (SSR) model that upscales these clips to high resolution. Figure 2

Figure 2: Lumiere pipeline showing the difference between the common approach of using TSR models and the Lumiere approach of processing all frames at once.

The SSR component leverages MultiDiffusion for spatial super-resolution, enabling smooth transitions and avoiding artifacts between temporal segments. MultiDiffusion thus ensures the coherence of the upscaled video output even in complex movements.

Applications

Lumiere facilitates a range of video editing applications due to its versatile architecture, including:

  • Style-Driven Generation: By interpolating between pre-trained and fine-tuned weights, Lumiere can produce videos in various artistic styles without compromising motion quality. Figure 3

    Figure 3: Stylized video generation showing Lumiere's ability to adapt to both vector art and realistic styles.

  • Conditional Generation: Supports video generation based on conditions such as image input or masks, allowing for customized motion within specified regions.
  • Inpainting and Cinemagraphs: Offers video inpainting capabilities for creatively animating masked regions while maintaining surrounding static content.

(Figure 4 and Figure 5)

Figure 4: Video inpainting with Lumiere animates masked areas effectively while maintaining natural transitions.

Figure 5: Cinemagraphs animate specific marked areas while keeping the rest static.

Evaluation and Comparisons

The model was trained on 30 million video-caption pairs and evaluated using a diverse set of text prompts for both zero-shot evaluation and user studies. Lumiere demonstrated competitive performance when evaluated against prominent T2V diffusion models, maintaining temporal consistency and generating higher motion magnitudes. Figure 6

Figure 6: User paper results showcasing Lumiere's preference over other baseline methods in text-to-video and image-to-video generation.

Conclusion

Lumiere sets a new direction in text-to-video generation by offering a framework capable of generating globally coherent motion without relying on cascaded temporal super-resolution models. The Space-Time U-Net architecture allows for more efficient handling of temporal data, providing promising results for various video content creation and editing tasks. These contributions advance the development of video generation models and present opportunities for further research in the domain of scalable video synthesis, including latent video diffusion models.

Given the model's ability to integrate various downstream applications seamlessly, it represents a valuable tool for creative content generation, though ethical considerations regarding its use in synthesizing realistic videos remain crucial.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com