VideoCrafter1: Open Diffusion Models for High-Quality Video Generation (2310.19512v1)

Published 30 Oct 2023 in cs.CV

Abstract: Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

Authors (12)

Haoxin Chen (12 papers)
Menghan Xia (33 papers)
Yingqing He (23 papers)
Yong Zhang (660 papers)
Xiaodong Cun (61 papers)
Shaoshu Yang (4 papers)
Jinbo Xing (19 papers)
Yaofang Liu (11 papers)
Qifeng Chen (187 papers)
Xintao Wang (132 papers)
Chao Weng (61 papers)
Ying Shan (252 papers)

Citations (194)

View on Semantic Scholar

Summary

The paper presents two diffusion models (T2V and I2V) that generate refined high-definition videos from text and image prompts.
It extends Stable Diffusion with temporal attention layers and a hybrid training strategy to ensure consistent frame quality and conceptual accuracy.
Open-sourcing VideoCrafter1 empowers community-driven enhancements, with future work targeting longer video durations and improved resolution.

Introduction to VideoCrafter1

Researchers from Tencent AI Lab and various universities have introduced two innovative diffusion models aimed at advancing high-quality video generation. These models, Text-to-Video (T2V) and Image-to-Video (I2V), bring new capabilities for video creation that can be instrumental for both academia and industry. The T2V model synthesizes videos based on textual input, while the I2V model can produce videos either from an image alone or by combining textual and visual inputs.

Diffusion Models for Video Generation

VideoCrafter1's T2V model marks a notable step forward, producing realistic and high-definition (1024x576 resolution) videos that surpass many open-source alternatives in terms of quality. Its text-to-video synthesis rests on a substantial dataset including LAION COCO 600M, Webvid10M, and a high-resolution video dataset of 10 million clips.

The I2V model, touted as the first of its kind in open-source platforms, can convert images into videos while strictly preserving their content and style. This development is particularly exciting as it addresses a gap in the current offerings of open-source video generation models, unlocking new avenues for technological progress within the community.

Technical Innovation

At its core, VideoCrafter1 leverages diffusion models that have been successful in the domain of image generation. The T2V model, for instance, extends the architecture of Stable Diffusion, incorporating temporal attention layers to capture the consistency across video frames. It also employs a hybrid training strategy that helps prevent the loss of conceptual accuracy.

Moreover, the I2V model introduces a unique approach to integrating image prompts. It uses both the CLIP text encoder and its image encoder counterpart to ensure that the text and image embeddings align, thereby enhancing the fidelity of the generated video content.

Implications and Future Work

By open-sourcing VideoCrafter1, the researchers have provided a foundation that could prove invaluable for further enhancements in the field of video generation. While the current models have limitations such as a maximum duration of 2 seconds, ongoing efforts are expected to expand this capability, improve resolution, and enhance motion quality. Collaborations and improvements in temporal layer models and spatial upscaling methodologies signal a promising trajectory for future advancements.

In conclusion, VideoCrafter1 presents remarkable progress in video generation technology. Its release not only demonstrates the capabilities of state-of-the-art AI but also invites broader participation from the research community, laying the groundwork for continuous evolution in this exciting field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/liontarakos/status/1861589148341068130

YouTube

Show All Videos