Learning Temporally Consistent Video Depth from Video Diffusion Priors (2406.01493v3)

Published 3 Jun 2024 in cs.CV

Abstract: This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Thus, instead of directly developing a depth estimator from scratch, we reformulate this predictive task into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth. Project page: https://xdimlab.github.io/ChronoDepth/.

Authors (9)

Jiahao Shao (4 papers)
Yuanbo Yang (7 papers)
Hongyu Zhou (50 papers)
Youmin Zhang (26 papers)
Yujun Shen (113 papers)
Matteo Poggi (71 papers)
Yiyi Liao (53 papers)
Vitor Guizilini (47 papers)
Yue Wang (678 papers)

Citations (12)

View on Semantic Scholar

Summary

The paper reformulates monocular video depth estimation as a conditional diffusion generation task, achieving superior temporal consistency.
A sequential fine-tuning protocol involving single-frame depth data and random clip-length sampling improves spatial accuracy and efficiency.
A temporal inpainting strategy enables enhanced overlap processing during inference, outperforming state-of-the-art models on benchmark datasets.

Learning Temporally Consistent Video Depth from Video Diffusion Priors

In the paper "Learning Temporally Consistent Video Depth from Video Diffusion Priors," Shao et al. present a novel approach to tackle the challenge of monocular video depth estimation, focusing on achieving not only spatial accuracy but also temporal consistency. This paper reformulates the depth estimation task into a conditional generation problem, leveraging the priors embedded in pre-trained video generative models, specifically, the Stable Video Diffusion (SVD) model.

Methodological Innovations

Reformulation as Conditional Generation

One of the central contributions is reformulating video depth estimation as a continuous-time denoising diffusion generation task. This approach translates the problem into leveraging a video foundation model that has been pre-trained on generating temporally consistent video content. The diffusion model used works in a latent space to manage computational efficiency, employing a variational autoencoder (VAE) for encoding depth maps into three-channel mimics of RGB images.

Fine-Tuning Protocols

The researchers identified optimal fine-tuning protocols for adapting the video generative model into an effective depth estimator. The key insights include:

Sequential Spatial-Temporal Fine-Tuning: Instead of jointly training the spatial and temporal layers, the authors found that training the spatial layers first and then keeping them frozen while training the temporal layers yielded better performance.
Incorporation of Single-Frame Depth Data: They demonstrated that including single-frame depth datasets in addition to video depth datasets significantly improved both spatial accuracy and temporal consistency.
Randomly Sampled Clip Length: During training, using a random clip length sampling strategy, rather than a fixed length, proved beneficial in enhancing performance.

Temporal Inpainting Inference

For inference on long videos, the authors introduced a temporal inpainting strategy to incorporate temporal information from overlapping frames between adjacent clips. This method showed a marked improvement in temporal consistency with minimal computational overhead, striking a balance between efficiency and performance.

Experimental Results

The paper presents extensive experimental evaluations that underscore the effectiveness of their approach:

Spatial Accuracy and Temporal Consistency: The proposed method outperforms several state-of-the-art baselines, including discriminative models (DPT, DepthAnyting) and generative models (Marigold). In particular, the method shows superior temporal consistency metrics while maintaining comparable spatial accuracy.
Downstream Applications: The researchers highlighted the practical benefits of their approach in two real-world applications: depth-conditioned video generation and novel view synthesis. Their method significantly improved the quality and consistency of generated videos and the 3D reconstruction quality in novel view synthesis tasks.

Quantitative Analysis

The experiments demonstrated concrete numerical improvements:

On the KITTI-360 dataset, the proposed method achieved a Sim. (multi-frame similarity) score of 0.91, outperforming all baseline methods, including Marigold which had a score of 1.12.
For depth-conditioned video generation, using the depth maps from their method yielded lower FVD scores (292.4) than baseline methods, indicating superior temporal consistency in the generated videos.

Implications and Future Directions

The implications of this research are multifaceted. The presented approach opens new avenues for integrating video generative models into other tasks requiring temporal consistency, beyond just depth estimation. Given the robust empirical findings, future research could focus on extending this methodology to other domains within computer vision, such as motion estimation and video frame interpolation.

Further developments could involve exploring more sophisticated training protocols to enhance the integration of spatial and temporal features and investigating the impact of different types of foundational video models. Potential advancements in hardware acceleration and optimization techniques for diffusion models could also drive the practical applicability of methods like those presented in this paper.

In summary, the work by Shao et al. represents a significant step in the evolution of video depth estimation, providing a method that successfully integrates the strengths of video generative models to achieve temporally consistent depth predictions. Their insights and methodologies lay a strong foundation for future advancements in the field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1797845081627496555

https://twitter.com/Almorgand/status/1935363197789421668

https://twitter.com/IAmACatAI/status/1798263494313431339