Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Blind Video Temporal Consistency (1808.00449v1)

Published 1 Aug 2018 in cs.CV

Abstract: Applying image processing algorithms independently to each frame of a video often leads to undesired inconsistent results over time. Developing temporally consistent video-based extensions, however, requires domain knowledge for individual tasks and is unable to generalize to other applications. In this paper, we present an efficient end-to-end approach based on deep recurrent network for enforcing temporal consistency in a video. Our method takes the original unprocessed and per-frame processed videos as inputs to produce a temporally consistent video. Consequently, our approach is agnostic to specific image processing algorithms applied on the original video. We train the proposed network by minimizing both short-term and long-term temporal losses as well as the perceptual loss to strike a balance between temporal stability and perceptual similarity with the processed frames. At test time, our model does not require computing optical flow and thus achieves real-time speed even for high-resolution videos. We show that our single model can handle multiple and unseen tasks, including but not limited to artistic style transfer, enhancement, colorization, image-to-image translation and intrinsic image decomposition. Extensive objective evaluation and subject study demonstrate that the proposed approach performs favorably against the state-of-the-art methods on various types of videos.

Citations (338)

Summary

  • The paper introduces a novel deep recurrent neural network to minimize both short- and long-term temporal losses while preserving perceptual similarity.
  • It bypasses optical flow computations, enabling real-time high-resolution video processing at over 400 fps.
  • Extensive evaluations show the method generalizes well across tasks like colorization and style transfer, though it may struggle with drastic frame-by-frame content changes.

Learning Blind Video Temporal Consistency

This paper addresses the issue of temporal inconsistency in video processing, which arises when image processing algorithms are applied independently to each frame of a video. Such approaches often lead to temporal flickering, a visually unpleasing artifact that can disrupt the viewer’s experience. The authors propose a novel method using a deep recurrent neural network to maintain temporal consistency across video frames without specific knowledge of the image processing algorithms applied to the input videos.

The core of the proposed approach is a deep network that enforces temporal consistency by minimizing both short-term and long-term temporal losses while maintaining perceptual similarity with processed frames. This is achieved through a combination of loss functions. The short-term temporal loss focuses on the immediate past frames to enforce local consistency, while the long-term loss enforces global coherence over the entire video sequence. Additionally, a perceptual loss is employed using a pre-trained VGG network to ensure that the output video maintains a high level of visual similarity to the processed frames.

Unlike previous methods that rely on optical flow computations to maintain temporal consistency, this approach does not require flow calculations at test time, leading to significant efficiency improvements. The proposed network can process high-resolution videos in real time, achieving more than 400 frames per second on 1280x720 videos. The method demonstrates applicability across various video processing tasks, including colorization, enhancement, artistic style transfer, and intrinsic image decomposition. This generality is accomplished by training the network on multiple tasks, allowing it to generalize well even to unseen applications.

The authors conduct extensive evaluations, using both objective metrics such as temporal warping error and perceptual distance, and subjective user studies to compare their proposed method with state-of-the-art techniques. The results indicate that their approach achieves a favorable balance between temporal stability and perceptual quality, outperforming existing methods, particularly in generalizing to tasks not seen during training.

Despite its impressive performance, the proposed method may struggle with tasks that drastically alter image content frame-by-frame, such as image completion or synthesis. Future work could address these limitations, possibly through incorporating stronger temporal constraints or video-specific priors into the network.

In summary, this paper makes a significant contribution to the field of video processing by presenting a robust method for achieving blind temporal stability without algorithm-specific knowledge. The efficiency and generalizability of the approach pave the way for further developments in video processing tasks, potentially influencing both theoretical advancements and practical applications in areas such as video editing and augmented reality.