Semi-Supervised Monocular Depth Estimation with Left-Right Consistency Using Deep Neural Network

Published 18 May 2019 in cs.CV, cs.AI, and cs.RO | (1905.07542v1)

Abstract: There has been tremendous research progress in estimating the depth of a scene from a monocular camera image. Existing methods for single-image depth prediction are exclusively based on deep neural networks, and their training can be unsupervised using stereo image pairs, supervised using LiDAR point clouds, or semi-supervised using both stereo and LiDAR. In general, semi-supervised training is preferred as it does not suffer from the weaknesses of either supervised training, resulting from the difference in the cameras and the LiDARs field of view, or unsupervised training, resulting from the poor depth accuracy that can be recovered from a stereo pair. In this paper, we present our research in single image depth prediction using semi-supervised training that outperforms the state-of-the-art. We achieve this through a loss function that explicitly exploits left-right consistency in a stereo reconstruction, which has not been adopted in previous semi-supervised training. In addition, we describe the correct use of ground truth depth derived from LiDAR that can significantly reduce prediction error. The performance of our depth prediction model is evaluated on popular datasets, and the importance of each aspect of our semi-supervised training approach is demonstrated through experimental results. Our deep neural network model has been made publicly available.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (54)

View on Semantic Scholar

Summary

The paper presents a novel semi-supervised method for monocular depth estimation that improves accuracy using left-right consistency in the loss function and preprocessed annotated depth maps.
Evaluation on datasets like KITTI shows the method achieves state-of-the-art results, improving metrics such as absolute relative difference and RMSE compared to prior techniques.
This improved depth estimation has significant implications for applications in robotics, autonomous navigation, and augmented/virtual reality systems.

Semi-Supervised Monocular Depth Estimation with Left-Right Consistency Using Deep Neural Network

The paper presents a novel approach in the field of computer vision and robotics, specifically focusing on the challenge of monocular depth estimation using semi-supervised deep neural networks. This research addresses the limitations commonly encountered in both supervised and unsupervised depth prediction methods, offering significant improvements in depth estimation accuracy by incorporating a left-right consistency term within the loss function.

The authors outline the deficiencies in existing supervised methods, which rely heavily on ground truth derived from LiDAR data and suffer from issues related to the difference in the field of view between cameras and LiDAR systems. This often results in incomplete depth mapping. Unsupervised methods, while capable of offering a more comprehensive prediction using stereo image pairs, are hampered by the inherent inaccuracies of stereo reconstruction. Semi-supervised methods combine elements of both approaches but have not fully resolved these limitations until now.

The introduction of left-right consistency is groundbreaking in semi-supervised training contexts for single image depth prediction. This consistency is achieved through a loss function that is specifically designed to optimize the prediction performance by ensuring that the depth disparity between left and right images is minimized. This component is pivotal in aligning outputs from monocular images closer to actual depth measurements, enhancing the model's robustness and accuracy.

The implementation of a method to mitigate the effect of noisy artifacts in LiDAR data deserves rigorous attention. By employing a preprocessed annotated depth map instead, the authors successfully reduced prediction errors, thus addressing a significant limitation of prior works that relied on raw LiDAR data. Such a refined approach ushers a promising avenue for subsequent training and deployment of models in varied environments.

The presented methodology has been exhaustively tested on popular datasets, such as the KITTI dataset, demonstrating superior results compared to state-of-the-art techniques. By integrating both annotated depth maps and stereo images in the training phase and using only monocular images during inference, the proposed deep neural network significantly improves single-image depth estimation.

Key quantitative results from the experimental evaluation include improvements in absolute relative difference, RMSE, and the accuracy within certain threshold metrics ( $\delta < 1.25$ ), denoting better predictions than many supervised, unsupervised, and semi-supervised techniques referenced in the field.

Future implications of this research extend into various practical and theoretical dimensions. The methodology proposed in this paper could enhance numerous computer vision tasks relevant to robotics, such as autonomous navigation, robotic grasping, and 3D reconstruction. Moreover, the foundational improvement in depth estimation accuracy lays the groundwork for subsequent advances in AI models dealing with environment perception and interaction. Areas such as augmented reality, virtual reality, and autonomous vehicles stand to benefit significantly from these enhancements.

In conclusion, the paper effectively contributes to the development of more accurate and reliable depth prediction models in computer vision, utilizing a semi-supervised framework that has been meticulously optimized for practical applications. By openly sharing their model, the authors encourage further research and integration into community-wide projects, signaling a collaborative step forward in monocular depth estimation technology.

Markdown Report Issue