Emergent Mind

Abstract

Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.

Overview

  • The paper introduces 'Upscale-A-Video,' a text-guided latent diffusion framework for video super-resolution that ensures high fidelity and temporal consistency.

  • The framework features a local-global temporal strategy using 3D convolutions and temporal attention to fine-tune image upscaling models.

  • A flow-guided recurrent latent propagation module operates bidirectionally to achieve stability over longer video sequences without additional training.

  • Text prompts and noise levels during inference provide user control over texture generation and detail refinement.

  • The method surpasses existing VSR methods in experimental benchmarks and offers practical applications for enhancing real-world video quality.

Real-World Video Super-Resolution Enhanced by Text-Guided Diffusion

Introduction

In the domain of video super-resolution (VSR), producing temporally consistent high-quality videos from low-quality originals is essential. Traditional methods have limitations in generating realistic textures and details because they usually rely on synthetic degradations or specific camera-related issues. Meanwhile, recent diffusion models hold great promise due to their generative capabilities but struggle when applied to VSR because of their stochastic nature, often leading to temporal inconsistencies.

Overcoming Temporal Inconsistency

A research initiative presents 'Upscale-A-Video,' a novel text-guided latent diffusion framework designed to upscale videos with high fidelity and temporal consistency. This solution integrates a local-global temporal strategy tailored for video data. Locally, it finetunes a pretrained image ×4 upscaling model with additional temporal layers that include 3D convolutions and temporal attention. Globally, it introduces a flow-guided recurrent latent propagation module to maintain consistency across longer sequences. This training-free module operates bidirectionally, enhancing stability by upgrading existing temporal features.

Versatility and User Control

The authors enhance their approach further by exploring text prompts and noise levels as additional conditions during inference. Text prompts guide the model to generate textures such as animal fur or textures resembling oil paintings. Adjusting the noise levels allows a balance between restoration power and the generation of refined details. Moreover, Classifier-Free Guidance is adopted, substantially improving the impact of text prompts and noise levels, thus refining video quality.

Experimental Success

Extensive experimentation showcases that 'Upscale-A-Video' outperforms current methods on synthetic, real-world, and AI-generated video benchmarks, displaying exceptional visual realism and temporal consistency. Quantitative measures like PSNR, SSIM, and LPIPS scores confirm the framework's superior restoration abilities. Qualitative analyses further underscore its impressive detail recovery and realistic texture generation, effectively leveraging text prompts and optional noise adjustment to deliver excellent fidelity and quality.

Conclusion

'Upscale-A-Video' marks a significant advancement in real-world VSR, successfully employing a text-guided latent diffusion model to enhance temporal coherence and detail generation. Its methodology offers a robust grounded foundation for future VSR tasks, particularly in real-world scenarios where temporal consistency and visual realism are crucial.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube