VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models (2403.05438v1)

Published 8 Mar 2024 in cs.CV

Abstract: Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.

References (1)

LAION-AI: aesthetic-predictor. https://github.com/LAION-AI/aesthetic-predictor (2022)

Authors (8)

Yabo Zhang (13 papers)
Yuxiang Wei (40 papers)
Xianhui Lin (11 papers)
Zheng Hui (27 papers)
Peiran Ren (28 papers)
Xuansong Xie (69 papers)
Xiangyang Ji (159 papers)
Wangmeng Zuo (279 papers)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces VideoElevator, a training-free, plug-and-play method that decomposes sampling into temporal motion refining and spatial quality elevating to improve video generation.
It employs techniques like low-pass filtering, SDEdit, and inflated self-attention to integrate T2I models, ensuring both motion consistency and high-resolution spatial details.
Empirical results demonstrate that VideoElevator significantly boosts frame quality and style fidelity over existing baselines, setting a new standard in text-to-video generation.

Enhancing Text-to-Video Generation with VideoElevator: A Dive into Versatile Text-to-Image Integration

Introduction to VideoElevator

The integration of text-to-image (T2I) diffusion models into the generation of videos has presented a promising avenue to address the lingering issue of low frame quality in text-to-video (T2V) diffusion models. Building on this concept, we introduce VideoElevator, a training-free and plug-and-play method designed to significantly enhance the quality of videos generated by existing T2V models through the superior capabilities of T2I diffusion models. VideoElevator stands out by explicitly decomposing the sampling process into two distinct components: temporal motion refining and spatial quality elevating. This strategic decomposition allows for a more refined approach in integrating T2I's impressive frame quality into video generation, ensuring both temporal consistency and high-definition spatial details.

Temporal Motion Refining

Temporal motion refining focuses on enhancing the temporal consistency of the video. The process begins with the application of a Low-Pass Frequency Filter (LPFF) along the temporal dimension of the video latent to mitigate high-frequency flickers. The refined video latent is then passed through a T2V-based SDEdit process to incorporate fine-grained motion, followed by a deterministic inversion to a noise distribution required by the T2I for further processing. This step is crucial for preserving the integrity of the natural motion within the video, a challenge that many existing models struggle to overcome fully.

Spatial Quality Elevating

Following the temporal refinement, spatial quality elevating aims to infuse the video with high-quality details leveraging the capabilities of T2I models. This is achieved by inflating the self-attention mechanism of the T2I model to allow for cross-frame attention, thereby ensuring appearance consistency across the video frames. The result is a series of frames that are not only temporally coherent but also rich in detail, closely matching the high-quality output characteristic of T2I-generated images.

Achievements and Evaluation

Empirical evaluations demonstrate VideoElevator's ability to significantly improve the performance of various T2V baselines when combined with different T2I models. This includes foundational T2I models such as Stable Diffusion V1.5 and V2.1-base, where VideoElevator was able to enhance frame quality, prompt consistency, and even align better with user-provided prompts. Furthermore, the application of VideoElevator with personalized T2I models facilitated a more faithful reproduction of various styles and aesthetic preferences, surpassing existing alternatives such as AnimateDiff in capturing style fidelity and detail richness.

Future Implications in AI

The introduction of VideoElevator signifies a leap forward in the quest for high-quality text-to-video generation. By meticulously breaking down the sampling step and effectively employing both T2V and T2I models, VideoElevator has set a new standard for generating videos that are not only rich in detail but also exhibit impressive temporal consistency. Looking forward, the potential for further exploration and advancement in combining these models promises exciting developments in the field of generative AI, where the lines between reality and AI-generated content continue to blur.

In conclusion, VideoElevator heralds a new era in text-to-video generation, leveraging the strengths of text-to-image diffusion models to address critical challenges in video quality and consistency. Its successful integration of temporal motion refining and spatial quality elevating underlines the potential of methodical decomposition in enhancing generative models. As research progresses, the methodologies embedded within VideoElevator may well pave the way for future innovations in the dynamic and ever-evolving field of generative AI.

PDF Markdown

Related Papers

GitHub

GitHub - YBYBZhang/VideoElevator: [Arxiv 2024] Official pytorch implementation of "VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models" (159 stars)

Tweets

https://twitter.com/_akhaliq/status/1767023767341977797

https://twitter.com/arankomatsuzaki/status/1767025755127722277

https://twitter.com/vladbogo/status/1767320277090726348

https://twitter.com/knishimae0531/status/1767520809189597416

https://twitter.com/javaeeeee1/status/1767151481424101652