Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue (1603.04992v2)

Published 16 Mar 2016 in cs.CV

Abstract: A significant weakness of most current deep Convolutional Neural Networks is the need to train them using vast amounts of manu- ally labelled data. In this work we propose a unsupervised framework to learn a deep convolutional neural network for single view depth predic- tion, without requiring a pre-training stage or annotated ground truth depths. We achieve this by training the network in a manner analogous to an autoencoder. At training time we consider a pair of images, source and target, with small, known camera motion between the two such as a stereo pair. We train the convolutional encoder for the task of predicting the depth map for the source image. To do so, we explicitly generate an inverse warp of the target image using the predicted depth and known inter-view displacement, to reconstruct the source image; the photomet- ric error in the reconstruction is the reconstruction loss for the encoder. The acquisition of this training data is considerably simpler than for equivalent systems, requiring no manual annotation, nor calibration of depth sensor to camera. We show that our network trained on less than half of the KITTI dataset (without any further augmentation) gives com- parable performance to that of the state of art supervised methods for single view depth estimation.

Citations (1,478)

View on Semantic Scholar

Summary

The paper presents an unsupervised CNN framework that learns depth from single images using stereo pairs and inverse warping for reconstruction.
It employs a modified AlexNet-based architecture with a coarse-to-fine approach, achieving competitive RMS and log RMS error metrics on the KITTI dataset.
This method reduces dependence on annotated data, enabling scalable and adaptable depth estimation for real-world applications.

Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

The paper "Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue" by Ravi Garg, Vijay Kumar B G, Gustavo Carneiro, and Ian Reid, introduces a novel unsupervised framework for learning a Convolutional Neural Network (CNN) to predict depth from a single image. The proposed methodology circumvents the need for pre-training or ground-truth depth annotations, which are essential in supervised learning approaches, leveraging stereo image pairs to generate training data through a novel autoencoder-inspired setup.

Approach and Methodology

The primary challenge addressed in this work is the dependency on large-scale annotated datasets for training CNNs, which is not only resource-intensive but often domain-specific. Thus, these trained models do not generalize well outside their immediate training context. This paper introduces an unsupervised model that operates by predicting depth through a convolutional encoder trained using stereo pairs. Unlike conventional autoencoders, the decoder in this setup is replaced with a known geometric transformation, specifically the inverse warp computed from the predicted depth and the known stereo displacement between the image pairs.

The network operates as follows:

Input Data: Pairs of images (source and target) with known camera motions, such as stereo pairs, are utilized.
Depth Prediction: The convolutional encoder predicts the depth map from the source image.
Reconstruction Loss: Using the predicted depth, the target image is inversely warped to reconstruct the source image. The reconstruction loss, measured as the photometric error between the reconstructed and the original source image, drives the training process.

Additionally, a simple smoothness prior is introduced to address ambiguities in homogeneous regions of the scene, leveraging an L2 regularization on disparity discontinuities.

Network Architecture

The network is based on a modified architecture similar to AlexNet, which transitions from a combination of convolutional and pooling layers to a series of fully convolutional layers, employing bilinear interpolation for upsampling depth maps at various stages. This coarse-to-fine training approach builds upon the initial coarse predictions and refines the depth estimations through successive layers, allowing the network to capture finer details in the input imagery.

Experimental Results

The experimental validation of this framework is performed on the KITTI dataset. The network, trained on less than half of this dataset without any ground-truth depth information, demonstrates comparable performance to state-of-the-art supervised methods. Quantitative comparisons highlight the model's strengths:

Root Mean Square Error (RMS): The unsupervised method achieves an RMS error of 5.104 post data augmentation, which is competitive against supervised models.
Log RMS: The log RMS error of 0.273 underscores the model's ability to capture depth accurately.
Relative Errors: Both absolute and squared relative errors are minimized, with values of 0.169 and 1.08 respectively.

Furthermore, qualitative analysis of predicted depth maps reveals that the model effectively preserves scene details and object boundaries, with superior performance in dynamic regions such as moving objects or pedestrians.

Implications and Future Directions

The implications of this research are significant. The presented framework paves the way for more accessible, scalable, and adaptable visual learning systems by removing the dependency on annotated training data:

Practical Utility: The ability to train depth prediction models from unlabelled stereo images substantially reduces the cost and effort associated with data collection.
Adaptability: The model’s potential for in-situ and lifelong learning opens opportunities for continuous improvement in dynamic environments, thereby increasing robustness and performance over time.

Future work could focus on integrating monocular SLAM systems to generalize this framework beyond stereo pairs, further advancing its applicability in diverse real-world scenarios. Additionally, exploring richer loss functions and refining the skip architectures could lead to even better depth estimation accuracy and detail preservation.

In conclusion, Garg et al.'s contribution to unsupervised depth estimation offers a promising alternative to supervised methods, setting a new direction for research in unsupervised learning and scene understanding. The methodologies and outcomes of this paper have the potential to influence a wide range of applications in computer vision and autonomous navigation.

PDF Markdown

Related Papers

YouTube

Show All Videos