Emergent Mind

Lumiere: A Space-Time Diffusion Model for Video Generation

(2401.12945)
Published Jan 23, 2024 in cs.CV

Abstract

We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

Lumiere pipeline contrasts common methods by processing all frames simultaneously for globally coherent motion.

Overview

  • Introduces Lumiere, a space-time diffusion model for video generation from text.

  • Utilizes STUNet architecture for generating full video sequences in a single pass.

  • Eliminates the need for temporal super-resolution by directly creating low-res, full frame-rate videos and applying spatial super-resolution.

  • Extends to various applications like image-to-video translation and video inpainting, and outperforms on UCF101 dataset.

  • Enhances video generation with better temporal coherency and paves the way for accessible content creation.

Introduction

The paper introduces Lumiere, a novel diffusion model tailored for generating videos from textual descriptions. The model's inception lies in the challenge of synthesizing videos that are not only photorealistic but also exhibit diverse, coherent motion over time. Contrary to prior models that craft videos by rendering distant keyframes and subsequently filling in gaps with temporal super-resolution, Lumiere employs a novel Space-Time U-Net (STUNet) architecture. This architecture allows the generation of an entire video sequence in a single network pass, focusing on both spatial and temporal down- and up-sampling.

Architectural Overview

Lumiere's U-Net-like architecture is distinctive due to its expansive down- and up-sampling operations across space-time dimensions. This structure facilitates the handling of full temporal durations of the video within a single pass of the model. This specific design choice implicitly enables more globally coherent motion in the videos when compared to prior models rooted in cascaded approaches that lacked temporal down-sampling and up-sampling. The absence of cascading temporal super-resolution models from Lumiere's pipeline is a salient feature that markedly differentiates it from its contemporaries.

Technical Contributions

Highlighting the core technical contributions, the authors underscore how Lumiere circumvents the need for temporal super-resolution modules by directly generating low-resolution, full frame-rate videos. The model then undergoes a spatial super-resolution phase, where temporal windows are leveraged, ensuring a coherent synthesis over the entire clip length. This is facilitated by a technique called MultiDiffusion, which addresses potential incoherencies in video segments. Additionally, Lumiere builds upon a pre-existing text-to-image diffusion model, selectively fine-tuning the temporal aspects of the architecture while preserving the pre-trained model's strengths.

Applications and Evaluation

In terms of applications, Lumiere extends beyond simple text-to-video generation to enable image-to-video translation, style-referenced generation, video inpainting, and more. The model's evaluation illustrates its efficacy in generating videos with considerable motion dynamics while maintaining visual quality and staying true to the guiding text prompts. Comparative studies reveal that Lumiere achieves competitive FVD and IS scores on the UCF101 dataset, asserting that it can successfully generate realistic videos that align closely with human perception.

Conclusion

Lumiere establishes a pioneering approach to video generation, overcoming challenges associated with temporal coherency and complexity. Through its innovative design and performance, it sets a new benchmark in the field and opens up possibilities for numerous creative applications, making content creation more accessible and versatile for users at various skill levels.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit