Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance

Published 14 Jul 2020 in cs.CV | (2007.06936v2)

Abstract: Self-supervised monocular depth estimation presents a powerful method to obtain 3D scene information from single camera images, which is trainable on arbitrary image sequences without requiring depth labels, e.g., from a LiDAR sensor. In this work we present a new self-supervised semantically-guided depth estimation (SGDepth) method to deal with moving dynamic-class (DC) objects, such as moving cars and pedestrians, which violate the static-world assumptions typically made during training of such models. Specifically, we propose (i) mutually beneficial cross-domain training of (supervised) semantic segmentation and self-supervised depth estimation with task-specific network heads, (ii) a semantic masking scheme providing guidance to prevent moving DC objects from contaminating the photometric loss, and (iii) a detection method for frames with non-moving DC objects, from which the depth of DC objects can be learned. We demonstrate the performance of our method on several benchmarks, in particular on the Eigen split, where we exceed all baselines without test-time refinement.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (298)

View on Semantic Scholar

Summary

The paper proposes a cross-domain training approach using semantic segmentation to guide monocular depth estimation, mitigating errors from dynamic objects.
It introduces a semantic masking scheme that excludes dynamic-class objects from photometric loss, ensuring accurate depth learning during training.
Experimental results on the KITTI benchmark show improved Abs Rel and RMSE metrics, underscoring strong model generalization without test-time refinement.

Analyzing Self-Supervised Monocular Depth Estimation with Semantic Guidance

The paper presents an innovative methodology for enhancing self-supervised monocular depth estimation by integrating semantic guidance. This approach addresses a significant challenge in monocular depth estimation: the dynamic object problem, which arises when moving objects, such as cars and pedestrians, disrupt the static world assumption during model training.

Methodological Contributions

Cross-Domain Training: The study proposes a mutually beneficial training regime that combines supervised semantic segmentation with self-supervised depth estimation. By leveraging task-specific network heads, this interdisciplinary approach bridges domain gaps and facilitates improved feature learning that benefits both tasks.
Semantic Masking: To address the inaccuracies caused by dynamic objects, the paper introduces a semantic masking scheme. This innovation prevents the contamination of photometric loss from frames containing moving dynamic-class (DC) objects by identifying and excluding them from the training loss computation.
Frame Detection for Non-Moving DC Objects: The authors develop a technique for detecting frames with non-moving DC objects, allowing the depth model to learn accurate depth cues from these static instances. This capability ensures that the model retains useful depth information for DC objects when they are stationary.

Experimental Results and Analysis

The authors conducted comprehensive evaluations using several benchmarks, notably the KITTI Eigen split, where the proposed approach outperformed existing methods in key metrics such as Absolute Relative Error (Abs Rel) and RMSE, without necessitating test-time refinement. These improvements are attributed to the synergy between segmentation and depth estimation tasks, which promotes better boundary detection and object demarcation.

The paper further extends its analysis to the KITTI depth prediction benchmark, performing competitively against both self-supervised and supervised models. Here, the method narrows the performance gap with supervised methods, indicating strong generalization capabilities despite the absence of explicit depth supervision during training.

Implications and Future Directions

The integration of semantic guidance into self-supervised depth estimation frameworks holds considerable promise for applications in autonomous driving and augmented reality. The paper's approach reduces computational complexity by omitting the extension of geometric projection models to account for moving objects, opting instead for a simpler yet effective masking strategy.

Future research directions could explore the refinement of pose estimation, a component that showed potential inefficiencies when jointly optimized with semantic segmentation. Furthermore, expanding the model's adaptability across various environments and lighting conditions could improve robustness and real-world applicability.

Conclusion

This work provides a potent solution to the dynamic object problem in self-supervised depth estimation, demonstrating the utility of semantic guidance in enhancing model performance. By effectively combining semantic segmentation with depth estimation, the proposed approach yields a model that not only excels in standard benchmarks but also offers insights into the broader applicability of cross-task learning in AI-driven perception systems.

Markdown Report Issue