Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression (2207.05894v1)

Published 13 Jul 2022 in eess.IV, cs.CV, and cs.MM

Abstract: For neural video codec, it is critical, yet challenging, to design an efficient entropy model which can accurately predict the probability distribution of the quantized latent representation. However, most existing video codecs directly use the ready-made entropy model from image codec to encode the residual or motion, and do not fully leverage the spatial-temporal characteristics in video. To this end, this paper proposes a powerful entropy model which efficiently captures both spatial and temporal dependencies. In particular, we introduce the latent prior which exploits the correlation among the latent representation to squeeze the temporal redundancy. Meanwhile, the dual spatial prior is proposed to reduce the spatial redundancy in a parallel-friendly manner. In addition, our entropy model is also versatile. Besides estimating the probability distribution, our entropy model also generates the quantization step at spatial-channel-wise. This content-adaptive quantization mechanism not only helps our codec achieve the smooth rate adjustment in single model but also improves the final rate-distortion performance by dynamic bit allocation. Experimental results show that, powered by the proposed entropy model, our neural codec can achieve 18.2% bitrate saving on UVG dataset when compared with H.266 (VTM) using the highest compression ratio configuration. It makes a new milestone in the development of neural video codec. The codes are at https://github.com/microsoft/DCVC.

Citations (118)

View on Semantic Scholar

Summary

The paper introduces a hybrid entropy model that leverages latent and dual spatial priors to effectively reduce temporal and spatial redundancies.
The model employs content-adaptive quantization for dynamic bit allocation, significantly improving rate-distortion performance.
Experimental results on the UVG dataset demonstrate an 18.2% bitrate reduction compared to traditional codecs, streamlining training overhead.

Overview of Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression

The paper introduces a sophisticated entropy model aimed at enhancing neural video codec efficiency by exploiting both spatial and temporal dependencies in video data. Neural video codecs have faced challenges in accurately predicting the probability distribution of quantized latent representations. Traditional methods often repurpose image codec entropy models without fully harnessing spatial-temporal video characteristics. This research proposes a refined entropy model that improves upon these aspects.

Key Contributions

Latent and Dual Spatial Priors: The paper presents the latent prior to address temporal correlation by leveraging the latent representation from previous frames. This approach aims to effectively squeeze temporal redundancy. Additionally, a dual spatial prior is introduced to minimize spatial redundancy efficiently through a parallel-friendly mechanism. This innovative design contrasts with traditional auto-regressive models and improves processing speed.
Content-Adaptive Quantization: The entropy model also serves to generate quantization steps spatial-channel-wise. This content-adaptive mechanism facilitates dynamic bit allocation, significantly enhancing rate-distortion (RD) performance. The adaptive approach offers smooth rate adjustments within a single model, reducing the necessity for multiple model training across different rates.

Experimental Results

The proposed neural codec demonstrates substantial improvements in compression efficiency. Experiments on the UVG dataset reveal an 18.2% bitrate reduction compared to the state-of-the-art traditional codec H.266 (VTM) with the highest compression configuration. This advancement represents a significant milestone, showcasing the potential superiority of neural codecs over traditional methods.

Practical and Theoretical Implications

From a practical standpoint, the model's capability to adjust rates within a single training model dramatically decreases training overhead, making it a more viable option for real-world applications. The adaptive bit allocation aligns with diverse video content, ensuring consistent quality across different video types. Theoretically, the model underscores the importance of leveraging spatial-temporal cues in video data, prompting potential future research into more efficient neural video compression frameworks.

Speculation on Future Developments in AI

Advances in neural video codecs may lead to broader impacts in AI, enhancing applications requiring real-time video processing (e.g., streaming services, video conferencing). These codecs could integrate with AI systems for improved visual data handling. Furthermore, exploring more complex spatial and temporal relationships in video data might yield smarter AI systems capable of understanding and interacting with dynamic environments more effectively.

In summary, this research exemplifies the integration of advanced entropy models within neural video codecs, delivering notable improvements in compression ratio and paving the way for continued advancements in this field.