Emergent Mind

Abstract

This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly developing a depth estimator from scratch, we reformulate the prediction task into a conditional generation problem. This allows us to leverage the prior knowledge embedded in existing video generation models, thereby reducing learning difficulty and enhancing generalizability. Concretely, we study how to tame the public Stable Video Diffusion (SVD) to predict reliable depth from input videos using a mixture of image depth and video depth datasets. We empirically confirm that a procedural training strategy -- first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen -- yields the best results in terms of both spatial accuracy and temporal consistency. We further examine the sliding window strategy for inference on arbitrarily long videos. Our observations indicate a trade-off between efficiency and performance, with a one-frame overlap already producing favorable results. Extensive experimental results demonstrate the superiority of our approach, termed ChronoDepth, over existing alternatives, particularly in terms of the temporal consistency of the estimated depth. Additionally, we highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis. Our project page is available at https://jhaoshao.github.io/ChronoDepth/.

Qualitative comparison of results on KITTI-360, ScanNet++, and MatrixCity datasets.

Overview

  • The paper by Shao et al. introduces a novel approach for monocular video depth estimation by leveraging pre-trained video generative models to achieve both high spatial accuracy and temporal consistency.

  • Key innovations include a conditional generation reformulation using a video diffusion model, optimal fine-tuning protocols, and a temporal inpainting method to maintain temporal consistency during inference.

  • Extensive experiments show their method outperforms state-of-the-art baselines in spatial accuracy and temporal consistency, demonstrating practical benefits in depth-conditioned video generation and novel view synthesis.

Learning Temporally Consistent Video Depth from Video Diffusion Priors

In the paper "Learning Temporally Consistent Video Depth from Video Diffusion Priors," Shao et al. present a novel approach to tackle the challenge of monocular video depth estimation, focusing on achieving not only spatial accuracy but also temporal consistency. This paper reformulates the depth estimation task into a conditional generation problem, leveraging the priors embedded in pre-trained video generative models, specifically, the Stable Video Diffusion (SVD) model.

Methodological Innovations

Reformulation as Conditional Generation

One of the central contributions is reformulating video depth estimation as a continuous-time denoising diffusion generation task. This approach translates the problem into leveraging a video foundation model that has been pre-trained on generating temporally consistent video content. The diffusion model used works in a latent space to manage computational efficiency, employing a variational autoencoder (VAE) for encoding depth maps into three-channel mimics of RGB images.

Fine-Tuning Protocols

The researchers identified optimal fine-tuning protocols for adapting the video generative model into an effective depth estimator. The key insights include:

  1. Sequential Spatial-Temporal Fine-Tuning: Instead of jointly training the spatial and temporal layers, the authors found that training the spatial layers first and then keeping them frozen while training the temporal layers yielded better performance.
  2. Incorporation of Single-Frame Depth Data: They demonstrated that including single-frame depth datasets in addition to video depth datasets significantly improved both spatial accuracy and temporal consistency.
  3. Randomly Sampled Clip Length: During training, using a random clip length sampling strategy, rather than a fixed length, proved beneficial in enhancing performance.

Temporal Inpainting Inference

For inference on long videos, the authors introduced a temporal inpainting strategy to incorporate temporal information from overlapping frames between adjacent clips. This method showed a marked improvement in temporal consistency with minimal computational overhead, striking a balance between efficiency and performance.

Experimental Results

The paper presents extensive experimental evaluations that underscore the effectiveness of their approach:

  • Spatial Accuracy and Temporal Consistency: The proposed method outperforms several state-of-the-art baselines, including discriminative models (DPT, DepthAnyting) and generative models (Marigold). In particular, the method shows superior temporal consistency metrics while maintaining comparable spatial accuracy.
  • Downstream Applications: The researchers highlighted the practical benefits of their approach in two real-world applications: depth-conditioned video generation and novel view synthesis. Their method significantly improved the quality and consistency of generated videos and the 3D reconstruction quality in novel view synthesis tasks.

Quantitative Analysis

The experiments demonstrated concrete numerical improvements:

  • On the KITTI-360 dataset, the proposed method achieved a Sim. (multi-frame similarity) score of 0.91, outperforming all baseline methods, including Marigold which had a score of 1.12.
  • For depth-conditioned video generation, using the depth maps from their method yielded lower FVD scores (292.4) than baseline methods, indicating superior temporal consistency in the generated videos.

Implications and Future Directions

The implications of this research are multifaceted. The presented approach opens new avenues for integrating video generative models into other tasks requiring temporal consistency, beyond just depth estimation. Given the robust empirical findings, future research could focus on extending this methodology to other domains within computer vision, such as motion estimation and video frame interpolation.

Further developments could involve exploring more sophisticated training protocols to enhance the integration of spatial and temporal features and investigating the impact of different types of foundational video models. Potential advancements in hardware acceleration and optimization techniques for diffusion models could also drive the practical applicability of methods like those presented in this paper.

In summary, the work by Shao et al. represents a significant step in the evolution of video depth estimation, providing a method that successfully integrates the strengths of video generative models to achieve temporally consistent depth predictions. Their insights and methodologies lay a strong foundation for future advancements in the field.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.