Deep Contextual Video Compression (2109.15047v2)

Published 30 Sep 2021 in eess.IV, cs.CV, and cs.MM

Abstract: Most of the existing neural video compression methods adopt the predictive coding framework, which first generates the predicted frame and then encodes its residue with the current frame. However, as for compression ratio, predictive coding is only a sub-optimal solution as it uses simple subtraction operation to remove the redundancy across frames. In this paper, we propose a deep contextual video compression framework to enable a paradigm shift from predictive coding to conditional coding. In particular, we try to answer the following questions: how to define, use, and learn condition under a deep video compression framework. To tap the potential of conditional coding, we propose using feature domain context as condition. This enables us to leverage the high dimension context to carry rich information to both the encoder and the decoder, which helps reconstruct the high-frequency contents for higher video quality. Our framework is also extensible, in which the condition can be flexibly designed. Experiments show that our method can significantly outperform the previous state-of-the-art (SOTA) deep video compression methods. When compared with x265 using veryslow preset, we can achieve 26.0% bitrate saving for 1080P standard test videos.

Authors (3)

Jiahao Li (80 papers)
Bin Li (514 papers)
Yan Lu (179 papers)

Citations (225)

View on Semantic Scholar

Summary

The paper introduces conditional coding that replaces traditional residue coding with learned contextual features for more adaptive video encoding.
It proposes a flexible framework that unifies encoding, decoding, and entropy modeling by leveraging high-dimensional spatial-temporal correlations.
Experimental results show a 26.0% bitrate reduction on 1080P videos compared to x265, underscoring its practical compression benefits.

A Review of "Deep Contextual Video Compression"

The preciseness of video compression technology has seen a constant evolution, prominently marked by the shift from traditional predictive coding paradigms toward AI-driven methodologies. The paper, "Deep Contextual Video Compression," authored by Jiahao Li, Bin Li, and Yan Lu, makes a significant stride in this domain by presenting a novel deep video compression framework that challenges conventionally adopted predictive coding techniques. By introducing conditional coding, the authors propose a method that purportedly exceeds the limitations of simple residue coding using high-dimensional feature domain conditions.

Core Insights and Contributions

The paper initiates with a critique of the inadequacy present in traditional video codification methods, predominantly occupied by predictive coding frameworks which rely on residue coding through subtraction operations. The base premise is that residue coding, though efficient due to inherent temporal correlations across video frames, is not optimal. Instead, the presented Deep Contextual Video Compression (DCVC) framework enables conditional coding by using feature domain context as a condition, effectively capturing richer and more predictive information about high-frequency content for improved video quality.

Key contributions of this research include:

Introduction of Conditional Coding: The notion of using learned contextual features as conditions, instead of predefined predicted frames, marks a significant alteration in encoding strategy. Unlike predictive coding dependent on subtraction, DCVC enables an adaptive approach where circumstances dictate whether a spatial or temporal correlation is more economically advantageous for encoding purposes.
Flexible Conditional Framework: The paper articulates that, within the DCVC framework, contextual information supports encoding, decoding, and entropy modeling in a unified manner. Through a flexible design, this allows the framework to adaptively define conditions based on learned features rather than static pixel correlations.
Entropy Modeling with Temporal Priors: By leveraging spatial-temporal correlations alongside hyperprior models in the entropy modeling process, the framework demonstrates an enhanced capability for compression, further distinguishing itself from traditional methods.
Experimental Superiority: Under rigorous testing of standard video datasets, DCVC reportedly achieves a 26.0% bitrate saving over the x265 encoder utilizing a veryslow preset for 1080P videos, indicating a clear performance advantage in both compression efficiency and video reconstruction quality.

Theoretical and Practical Implications

From a theoretical perspective, the DCVC framework asserts the feasibility of dynamically learning conditions and therefore potentially diminishing the shortcomings of previous methods rooted in static correlations inherent to residue coding. This pivot implies an advance in video coding theory that might well accommodate more diverse and complex video streams.

Practically, the results evidenced in the experiments suggest that such models could lead to implementations that either reduce storage requirements or enhance streaming capabilities without sacrificing quality. Particularly for high-resolution videos that encounter substantial amounts of high-frequency data, the ability to decode richer contextual information could mean sharper, clearer imaging for end-users.

Future Directions

The paper admits the open-endedness in how condition definition, use, and learning occur within the framework, suggesting further exploration could refine or advance DCVC's capabilities. Specifically, optimizing the dimensionality of context and addressing temporal stability within video sequences are areas ripe for development. The integration of newer methods, such as transformer-based models to handle global correlations, might also offer novel avenues for enhancing compression efficacy.

In conclusion, this paper provides a noteworthy advancement in the video compression field, suggesting a robust framework for overcoming existing limitations in predictive coding. The highlighted experiments underscore its practical viability, setting a precedence for future enhancements and applications in real-world video compression tasks.

PDF Markdown