Emergent Mind

Abstract

Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process at a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenates the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, and both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.

Overview

  • DreamVideo is an advanced generative model that transforms still images into high-fidelity video clips while retaining image details.

  • The model uses video diffusion techniques combined with a frame retention branch to maintain image quality.

  • Classifier-free guidance allows users to influence the video generation with textual descriptions.

  • Quantitative and qualitative assessments show DreamVideo's superiority over other models in image retention and content control.

  • The technology is relevant for applications in digital art, entertainment, and instructional video generation.

Introduction to DreamVideo

The innovation of technology in the field of generative models has made significant progress in the ability to create realistic videos from still images, also known as image-to-video generation. DreamVideo is an advanced model within this domain that aims to preserve the fidelity of a reference image when generating a video clip. Notably, this model offers not only high-fidelity transformations but also enables users to direct the action within the video using textual descriptions.

Underlying Technology

DreamVideo is built upon pre-existing video diffusion models, which are a type of probabilistic generative model. Diffusion models work by incrementally adding noise to an image and then learning to reverse this process—essentially 'denoising' to generate new data. DreamVideo uniquely maintains the details of a static input image through the generation process by utilizing a dedicated frame retention branch in its architecture. This branch takes the reference image, processes it through convolution layers, and combines the features with the nosy latents as model input, which assists in preserving the original image details during video generation.

Furthermore, DreamVideo incorporates what is known as classifier-free guidance. This technique enhances the model's control over the transformation process, allowing a single static image to evolve into videos of multiple actions simply by altering the text description provided.

Performance and Applications

Extensive experiments and comparisons with other state-of-the-art models demonstrate DreamVideo's superior capabilities. Measured using quantitative benchmarks such as the Fréchet Video Distance (FVD) and qualitative user studies, DreamVideo has showcased more significant image retention and control over the video content. Its unique contribution lies in its ability to allow different resulting videos from the same image using different text prompts and its robustness in maintaining image detail fidelity throughout the video generation process.

Conclusion

The DreamVideo model represents a remarkable step forward for image-to-video technology, offering flexibility in animation control without sacrificing image quality. Its high fidelity and precision in control through textual guidance broaden the possibilities for applications in areas like digital art, film, and entertainment, as well as practical applications that require detailed video demonstrations from static images. DreamVideo sets a new benchmark for performance in image-to-video generation models, signaling exciting future advancements in this creative technology field.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.