Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution (2312.06640v1)

Published 11 Dec 2023 in cs.CV

Abstract: Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a novel text-guided latent diffusion framework that improves video super-resolution by enhancing temporal consistency.
It combines local 3D convolutions with a global flow-guided recurrent module to address both short-term and long-term video inconsistencies.
User-controlled text prompts and noise adjustments refine texture generation, achieving superior PSNR, SSIM, and LPIPS scores.

Real-World Video Super-Resolution Enhanced by Text-Guided Diffusion

Introduction

In the domain of video super-resolution (VSR), producing temporally consistent high-quality videos from low-quality originals is essential. Traditional methods have limitations in generating realistic textures and details because they usually rely on synthetic degradations or specific camera-related issues. Meanwhile, recent diffusion models hold great promise due to their generative capabilities but struggle when applied to VSR because of their stochastic nature, often leading to temporal inconsistencies.

Overcoming Temporal Inconsistency

A research initiative presents 'Upscale-A-Video,' a novel text-guided latent diffusion framework designed to upscale videos with high fidelity and temporal consistency. This solution integrates a local-global temporal strategy tailored for video data. Locally, it finetunes a pretrained image ×4 upscaling model with additional temporal layers that include 3D convolutions and temporal attention. Globally, it introduces a flow-guided recurrent latent propagation module to maintain consistency across longer sequences. This training-free module operates bidirectionally, enhancing stability by upgrading existing temporal features.

Versatility and User Control

The authors enhance their approach further by exploring text prompts and noise levels as additional conditions during inference. Text prompts guide the model to generate textures such as animal fur or textures resembling oil paintings. Adjusting the noise levels allows a balance between restoration power and the generation of refined details. Moreover, Classifier-Free Guidance is adopted, substantially improving the impact of text prompts and noise levels, thus refining video quality.

Experimental Success

Extensive experimentation showcases that 'Upscale-A-Video' outperforms current methods on synthetic, real-world, and AI-generated video benchmarks, displaying exceptional visual realism and temporal consistency. Quantitative measures like PSNR, SSIM, and LPIPS scores confirm the framework's superior restoration abilities. Qualitative analyses further underscore its impressive detail recovery and realistic texture generation, effectively leveraging text prompts and optional noise adjustment to deliver excellent fidelity and quality.

Conclusion

'Upscale-A-Video' marks a significant advancement in real-world VSR, successfully employing a text-guided latent diffusion model to enhance temporal coherence and detail generation. Its methodology offers a robust grounded foundation for future VSR tasks, particularly in real-world scenarios where temporal consistency and visual realism are crucial.

PDF Markdown

Related Papers

YouTube

Show All Videos