SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes (2211.03660v2)

Published 7 Nov 2022 in cs.CV

Abstract: Self-supervised monocular depth estimation has shown impressive results in static scenes. It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions and occlusions. Consequently, existing methods show poor accuracy in dynamic scenes, and the estimated depth map is blurred at object boundaries because they are usually occluded in other training views. In this paper, we propose SC-DepthV3 for addressing the challenges. Specifically, we introduce an external pretrained monocular depth estimation model for generating single-image depth prior, namely pseudo-depth, based on which we propose novel losses to boost self-supervised training. As a result, our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes. We demonstrate the significantly superior performance of our method over previous methods on six challenging datasets, and we provide detailed ablation studies for the proposed terms. Source code and data will be released at https://github.com/JiawangBian/sc_depth_pl

Citations (35)

View on Semantic Scholar

Summary

The paper introduces SC-DepthV3, which integrates pretrained pseudo-depth cues to overcome limitations in dynamic scenes.
It employs Dynamic Region Refinement and Local Structure Refinement to enhance edge fidelity and depth accuracy near moving objects.
Experimental results demonstrate significant performance gains on diverse datasets, promising improvements for autonomous driving and robotics.

Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes

The paper "SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes" presents advancements in the field of monocular depth estimation, particularly in handling dynamic scenes where traditional methods face challenges due to moving objects and occlusions.

Summary of Contributions

The authors of the paper address the limitations of existing self-supervised methods which often rely on multi-view consistency—a constraint violated by dynamic objects and occlusions, leading to poor depth estimation accuracy in such scenarios. To tackle these issues, the authors propose SC-DepthV3, a method leveraging an external pretrained monocular depth model to provide pseudo-depth, which then enhances self-supervised learning through novel loss functions. This integration permits the prediction of sharp and accurate depth maps even when training from monocular videos with substantial dynamics.

Methodology

The proposed approach involves the following key components:

Dynamic Region Refinement (DRR): This module introduces a novel approach to mitigating inaccuracies in dynamic regions. By utilizing depth ranking derived from pseudo-depths, the module effectively regularizes the self-supervised learning process. This method circumvents the complexities associated with explicit object motion modeling or exclusion of dynamic regions from the training phase.
Local Structure Refinement (LSR): Focusing on object boundaries and local structures, the LSR module applies normal matching and relative normal angle constraints between predicted depths and pseudo-depths. This helps in refining edge details and maintaining coherent depth predictions across object boundaries, which are often problematic in self-supervised frameworks.

Experimental Results

The effectiveness of SC-DepthV3 is demonstrated across several dynamic and static datasets, including DDAD, BONN, TUM, KITTI, NYUv2, and IBims-1. Notably, the proposed method delivers superior performance on highly dynamic datasets like DDAD and BONN, where fast-moving objects present significant challenges. The improvements are evident in metrics such as Absolute Relative Error (AbsRel) and the accuracy under various thresholds, $\delta_i$ . Additionally, the method excels in providing sharper depth predictions at object boundaries.

Implications and Future Directions

The introduction of SC-DepthV3 adds robustness to the domain of self-supervised monocular depth estimation, particularly in scenarios heavily populated with dynamic elements. The integration of pseudo-depth as a supplemental cue in self-supervised training suggests a new direction for leveraging pretrained models even when ground truth labels are absent.

From a practical perspective, SC-DepthV3 has the potential to enhance applications in autonomous driving and robotics where real-time, reliable depth estimation is crucial amidst dynamic environments. Theoretically, it opens avenues for exploring deeper interactions between supervised and self-supervised learning paradigms, potentially leading to hybrid approaches that can balance the advantages of both.

Future research might focus on refining the incorporation of pseudo-depth for various environments and exploring additional cues from other pretrained models to supplement depth estimation in complex scenarios. The successful application of SC-DepthV3 on static datasets like KITTI and NYUv2 also suggests its adaptability and the potential for further optimization across different domains.

In conclusion, SC-DepthV3 represents a significant advancement in self-supervised monocular depth estimation, addressing key challenges associated with dynamic scenes and demonstrating substantial improvements in depth map quality and accuracy.