Lumiere: A Space-Time Diffusion Model for Video Generation (2401.12945v2)

Published 23 Jan 2024 in cs.CV

Abstract: We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

References (54)

Citations (143)

View on Semantic Scholar

Summary

The paper introduces a Space-Time U-Net architecture that generates full-frame videos with consistent motion in one pass.
It employs MultiDiffusion for spatial super-resolution, ensuring smooth and artifact-free upscaling across temporal segments.
User studies show improved temporal consistency and motion quality compared to baseline T2V models, validating its wide range of video editing applications.

Lumiere: A Space-Time Diffusion Model for Video Generation

Introduction

"Lumiere: A Space-Time Diffusion Model for Video Generation" introduces a novel approach to synthesizing videos that depict realistic, diverse, and coherent motion using a text-to-video diffusion model. It addresses the fundamental challenge of achieving global temporal consistency in video synthesis by proposing a Space-Time U-Net architecture capable of generating the entire temporal duration of a video in a single model pass.

Space-Time U-Net Architecture

The Lumiere model employs a Space-Time U-Net (STUNet) architecture to effectively handle video generation. This architecture facilitates the simultaneous processing of spatial and temporal dimensions, thereby enabling the generation of videos with consistent motion.

Figure 1: STUNet architecture inflates a pre-trained T2I U-Net architecture into a Space-Time UNet, incorporating both spatial and temporal down- and up-sampling modules.

Unlike existing models that rely on cascaded temporal super-resolution which can lead to temporal inconsistencies, STUNet generates full-frame-rate videos end-to-end. This design choice, overlooked by previous methods, allows for the synthesis of coherent motion over longer video durations, up to 5 seconds at 16 fps.

Video Generation Pipeline

Lumiere's video generation pipeline consists of two main components: a base model for generating low-resolution video clips and a spatial super-resolution (SSR) model that upscales these clips to high resolution.

Figure 2: Lumiere pipeline showing the difference between the common approach of using TSR models and the Lumiere approach of processing all frames at once.

The SSR component leverages MultiDiffusion for spatial super-resolution, enabling smooth transitions and avoiding artifacts between temporal segments. MultiDiffusion thus ensures the coherence of the upscaled video output even in complex movements.

Applications

Lumiere facilitates a range of video editing applications due to its versatile architecture, including:

Style-Driven Generation: By interpolating between pre-trained and fine-tuned weights, Lumiere can produce videos in various artistic styles without compromising motion quality.
Figure 3: Stylized video generation showing Lumiere's ability to adapt to both vector art and realistic styles.
Conditional Generation: Supports video generation based on conditions such as image input or masks, allowing for customized motion within specified regions.
Inpainting and Cinemagraphs: Offers video inpainting capabilities for creatively animating masked regions while maintaining surrounding static content.

(Figure 4 and Figure 5)

Figure 4: Video inpainting with Lumiere animates masked areas effectively while maintaining natural transitions.

Figure 5: Cinemagraphs animate specific marked areas while keeping the rest static.

Evaluation and Comparisons

The model was trained on 30 million video-caption pairs and evaluated using a diverse set of text prompts for both zero-shot evaluation and user studies. Lumiere demonstrated competitive performance when evaluated against prominent T2V diffusion models, maintaining temporal consistency and generating higher motion magnitudes.

Figure 6: User paper results showcasing Lumiere's preference over other baseline methods in text-to-video and image-to-video generation.

Conclusion

Lumiere sets a new direction in text-to-video generation by offering a framework capable of generating globally coherent motion without relying on cascaded temporal super-resolution models. The Space-Time U-Net architecture allows for more efficient handling of temporal data, providing promising results for various video content creation and editing tasks. These contributions advance the development of video generation models and present opportunities for further research in the domain of scalable video synthesis, including latent video diffusion models.

Given the model's ability to integrate various downstream applications seamlessly, it represents a valuable tool for creative content generation, though ethical considerations regarding its use in synthesizing realistic videos remain crucial.