Emergent Mind


Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

VideoCrafter1's framework trains video UNet within an auto-encoder's latent space, controlling motion speed via FPS.


  • VideoCrafter1 introduces Text-to-Video (T2V) and Image-to-Video (I2V) diffusion models from Tencent AI Lab for advanced video generation based on text and images.

  • The T2V model is capable of creating high-definition videos from text descriptions, utilizing a large dataset for training and production.

  • The I2V model is pioneering in transforming images into consistent and stylistically preserved videos, filling a gap in open-source video generation resources.

  • Technical innovations include the extension of Stable Diffusion architecture and the use of a hybrid training approach to maintain conceptual accuracy in video synthesis.

  • Despite the current limitations such as 2-second video duration, ongoing efforts aim to improve VideoCrafter1's capabilities, which has been made open-source to foster community development.

Introduction to VideoCrafter1

Researchers from Tencent AI Lab and various universities have introduced two innovative diffusion models aimed at advancing high-quality video generation. These models, Text-to-Video (T2V) and Image-to-Video (I2V), bring new capabilities for video creation that can be instrumental for both academia and industry. The T2V model synthesizes videos based on textual input, while the I2V model can produce videos either from an image alone or by combining textual and visual inputs.

Diffusion Models for Video Generation

VideoCrafter1's T2V model marks a notable step forward, producing realistic and high-definition (1024x576 resolution) videos that surpass many open-source alternatives in terms of quality. Its text-to-video synthesis rests on a substantial dataset including LAION COCO 600M, Webvid10M, and a high-resolution video dataset of 10 million clips.

The I2V model, touted as the first of its kind in open-source platforms, can convert images into videos while strictly preserving their content and style. This development is particularly exciting as it addresses a gap in the current offerings of open-source video generation models, unlocking new avenues for technological progress within the community.

Technical Innovation

At its core, VideoCrafter1 leverages diffusion models that have been successful in the domain of image generation. The T2V model, for instance, extends the architecture of Stable Diffusion, incorporating temporal attention layers to capture the consistency across video frames. It also employs a hybrid training strategy that helps prevent the loss of conceptual accuracy.

Moreover, the I2V model introduces a unique approach to integrating image prompts. It uses both the CLIP text encoder and its image encoder counterpart to ensure that the text and image embeddings align, thereby enhancing the fidelity of the generated video content.

Implications and Future Work

By open-sourcing VideoCrafter1, the researchers have provided a foundation that could prove invaluable for further enhancements in the field of video generation. While the current models have limitations such as a maximum duration of 2 seconds, ongoing efforts are expected to expand this capability, improve resolution, and enhance motion quality. Collaborations and improvements in temporal layer models and spatial upscaling methodologies signal a promising trajectory for future advancements.

In conclusion, VideoCrafter1 presents remarkable progress in video generation technology. Its release not only demonstrates the capabilities of state-of-the-art AI but also invites broader participation from the research community, laying the groundwork for continuous evolution in this exciting field.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.