DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance (2312.03018v4)

Published 5 Dec 2023 in cs.CV

Abstract: Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process at a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenates the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, and both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a novel architecture that preserves static image details using a dedicated frame retention branch during video generation.
It employs classifier-free guidance to convert a single image into diverse video outputs based on varying text prompts.
Experimental results demonstrate that DreamVideo outperforms state-of-the-art diffusion models in retaining image fidelity and controlling video content.

Introduction to DreamVideo

The innovation of technology in the field of generative models has made significant progress in the ability to create realistic videos from still images, also known as image-to-video generation. DreamVideo is an advanced model within this domain that aims to preserve the fidelity of a reference image when generating a video clip. Notably, this model offers not only high-fidelity transformations but also enables users to direct the action within the video using textual descriptions.

Underlying Technology

DreamVideo is built upon pre-existing video diffusion models, which are a type of probabilistic generative model. Diffusion models work by incrementally adding noise to an image and then learning to reverse this process—essentially 'denoising' to generate new data. DreamVideo uniquely maintains the details of a static input image through the generation process by utilizing a dedicated frame retention branch in its architecture. This branch takes the reference image, processes it through convolution layers, and combines the features with the nosy latents as model input, which assists in preserving the original image details during video generation.

Furthermore, DreamVideo incorporates what is known as classifier-free guidance. This technique enhances the model's control over the transformation process, allowing a single static image to evolve into videos of multiple actions simply by altering the text description provided.

Performance and Applications

Extensive experiments and comparisons with other state-of-the-art models demonstrate DreamVideo's superior capabilities. Measured using quantitative benchmarks such as the Fréchet Video Distance (FVD) and qualitative user studies, DreamVideo has showcased more significant image retention and control over the video content. Its unique contribution lies in its ability to allow different resulting videos from the same image using different text prompts and its robustness in maintaining image detail fidelity throughout the video generation process.

Conclusion

The DreamVideo model represents a remarkable step forward for image-to-video technology, offering flexibility in animation control without sacrificing image quality. Its high fidelity and precision in control through textual guidance broaden the possibilities for applications in areas like digital art, film, and entertainment, as well as practical applications that require detailed video demonstrations from static images. DreamVideo sets a new benchmark for performance in image-to-video generation models, signaling exciting future advancements in this creative technology field.

PDF Markdown

Related Papers

GitHub

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance