Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry (1807.02570v2)

Published 6 Jul 2018 in cs.CV

Abstract: Monocular visual odometry approaches that purely rely on geometric cues are prone to scale drift and require sufficient motion parallax in successive frames for motion estimation and 3D reconstruction. In this paper, we propose to leverage deep monocular depth prediction to overcome limitations of geometry-based monocular visual odometry. To this end, we incorporate deep depth predictions into Direct Sparse Odometry (DSO) as direct virtual stereo measurements. For depth prediction, we design a novel deep network that refines predicted depth from a single image in a two-stage process. We train our network in a semi-supervised way on photoconsistency in stereo images and on consistency with accurate sparse depth reconstructions from Stereo DSO. Our deep predictions excel state-of-the-art approaches for monocular depth on the KITTI benchmark. Moreover, our Deep Virtual Stereo Odometry clearly exceeds previous monocular and deep learning based methods in accuracy. It even achieves comparable performance to the state-of-the-art stereo methods, while only relying on a single camera.

Citations (324)

View on Semantic Scholar

Summary

The paper introduces DVSO, a method that integrates deep depth prediction into the DSO framework to mitigate scale drift in monocular visual odometry.
It employs the StackNet architecture, combining SimpleNet and ResidualNet, to initialize depths and incorporate virtual stereo constraints into optimization.
Experiments on the KITTI dataset demonstrate that DVSO significantly outperforms traditional monocular methods while achieving competitive accuracy with stereo systems.

An Expert Overview of Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry

The paper "Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry" presents an innovative approach to enhance monocular visual odometry (VO) using deep learning techniques. The authors introduce Deep Virtual Stereo Odometry (DVSO) as a method that integrates deep monocular depth predictions into the Direct Sparse Odometry (DSO) framework. This integration aims to enhance the accuracy and robustness of monocular VO systems by mitigating common issues such as scale drift.

Background and Motivation

Monocular VO predominantly relies on geometric cues, which can lead to challenges like scale ambiguity and drift due to the lack of direct depth measurements. These problems are often addressed using stereo or multi-sensor setups, but they come with added costs and complexity. The authors propose a monocular system that achieves performance comparable to stereo methods by leveraging deep learning for depth estimation. This approach seeks to capitalize on the metric depth recovery from a single image, enabled by prior data knowledge, making it suitable for applications like autonomous driving and robotics where cost and sensor simplicity are crucial.

Methodology

The core contribution of this research is the integration of deep learning-based depth prediction into the DSO pipeline. The authors develop a novel neural network—StackNet—that refines depth estimates through a two-stage process involving SimpleNet and ResidualNet sub-networks. Each network focuses on different aspects of disparity estimation, combining supervised training using sparse ground-truth depth from Stereo DSO and self-supervised learning via stereo image photometric consistency.

In DVSO, the disparity predictions inform the DSO in two main ways:

Depth Initialization: When new keyframes are added, the depth information is initialized using predictions from StackNet, providing a metric scale that reduces the drift inherent to monocular systems.
Virtual Stereo Constraints: The paper introduces novel virtual stereo terms in the DSO's optimization process. These terms enforce consistency between estimated depths and deep-learning predictions, effectively utilizing depth predictions as additional constraints.

Results and Evaluation

The authors benchmark DVSO against existing VO systems on the KITTI dataset. Their results demonstrate significant improvements over traditional monocular methods and showcase competitive performance when compared to stereoscopic systems. DVSO achieves this while maintaining monocular simplicity, requiring only a single camera. The system also outperforms state-of-the-art deep-learning approaches for monocular depth prediction in both accuracy and qualitative assessment, highlighting its robustness in various settings.

Implications and Future Directions

The integration of deep learning for depth prediction in monocular VO systems represents a substantial step towards achieving high-performance odometry without the complexities of stereo or multi-sensor setups. The DVSO framework has potential implications for numerous applications, potentially reducing the barrier to entry for incorporating advanced VO capabilities in cost-sensitive domains like consumer electronics and low-cost robotics.

Moving forward, the adaptability and generalization of DVSO could be enhanced by exploring end-to-end fine-tuning techniques. Such advancements might allow the system to continually adapt to new environments and camera configurations, thereby improving its robustness and applicability across more diverse scenarios. Additionally, the insights gained from this research could inform developments in other domains where depth perception from monocular cues is critical, such as augmented reality and mobile robotics.

In conclusion, the research presented in this paper provides a concrete advancement in the field of monocular vision systems. By leveraging deep learning to refine depth predictions, DVSO delivers superior accuracy and reliability, demonstrating the potential of combining geometric and data-driven approaches in visual odometry.

PDF Markdown