Towards real-time unsupervised monocular depth estimation on CPU (1806.11430v3)

Published 29 Jun 2018 in cs.CV and cs.RO

Abstract: Unsupervised depth estimation from a single image is a very attractive technique with several implications in robotic, autonomous navigation, augmented reality and so on. This topic represents a very challenging task and the advent of deep learning enabled to tackle this problem with excellent results. However, these architectures are extremely deep and complex. Thus, real-time performance can be achieved only by leveraging power-hungry GPUs that do not allow to infer depth maps in application fields characterized by low-power constraints. To tackle this issue, in this paper we propose a novel architecture capable to quickly infer an accurate depth map on a CPU, even of an embedded system, using a pyramid of features extracted from a single input image. Similarly to state-of-the-art, we train our network in an unsupervised manner casting depth estimation as an image reconstruction problem. Extensive experimental results on the KITTI dataset show that compared to the top performing approach our network has similar accuracy but a much lower complexity (about 6% of parameters) enabling to infer a depth map for a KITTI image in about 1.7 s on the Raspberry Pi 3 and at more than 8 Hz on a standard CPU. Moreover, by trading accuracy for efficiency, our network allows to infer maps at about 2 Hz and 40 Hz respectively, still being more accurate than most state-of-the-art slower methods. To the best of our knowledge, it is the first method enabling such performance on CPUs paving the way for effective deployment of unsupervised monocular depth estimation even on embedded systems.

Citations (154)

View on Semantic Scholar

Summary

The paper introduces PyD-Net, a novel pyramidal architecture that achieves accurate depth estimation on CPUs using approximately 6% of the parameters of standard CNNs.
The paper demonstrates PyD-Net’s practicality with runtimes of about 1.7 seconds per image on ARM CPUs and over 8 Hz on x86 systems, validated on the KITTI dataset.
The paper highlights the potential for deploying low-power, real-time depth estimation in embedded systems, expanding applications in robotics, autonomous navigation, and augmented reality.

Real-Time Unsupervised Monocular Depth Estimation on CPU

In recent years, unsupervised monocular depth estimation—particularly through deep learning—has become a prominent area of research due to its diverse potential applications in robotics, autonomous navigation, and augmented reality. This paper addresses a significant gap in this domain: the challenge of real-time processing on resource-constrained environments, such as CPUs, particularly those in embedded systems. The proposed solution is PyD-Net, a novel network architecture designed to infer accurate depth maps efficiently without the dependency on power-intensive GPUs.

The traditional obstacle in depth estimation models involves the substantial complexity and computational requirements of state-of-the-art Convolutional Neural Networks (CNNs), which make real-time performance predominantly restricted to high-power GPUs. However, many applications, especially those with stringent power constraints (e.g., UAVs, wearable devices), necessitate efficient CPU-based processing. PyD-Net emerges as a noteworthy solution with a significantly reduced computational footprint—utilizing approximately 6% of the parameters of leading approaches while maintaining comparable accuracy. This efficiency is achieved through a pyramidal architecture, which processes multiple resolutions of image features, refining the depth map progressively from coarse to fine levels.

The paper provides an extensive evaluation of the PyD-Net architecture on the KITTI dataset under unsupervised training conditions. It demonstrates that PyD-Net can generate depth maps with accuracy comparable to state-of-the-art models, using only a fraction of resources in terms of execution time and memory requirements. The network achieves a remarkable runtime of about 1.7 seconds per image on an ARM CPU (Raspberry Pi 3), compared to over 10 seconds required by traditional models, and over 8 Hz on a standard x86 CPU. Furthermore, it offers flexibility in trading minor accuracy reductions for substantial efficiency gains, with performance speeds of approximately 2 Hz and 40 Hz at varying accuracy levels.

The implications of the proposed framework are profound for both theoretical and practical domains. Practically, it enables the deployment of monocular depth estimation in low-power contexts that were previously impractical due to hardware constraints. Theoretically, it suggests the potential for further architectural innovations in deep learning models that embrace a pyramidal processing paradigm, thus reducing computational and memory burdens. The pyramidal feature extraction and multi-scale depth estimation adopt a strategy akin to optical flow estimations in computer vision, reinforcing the versatility of pyramid-based architectures.

For future development, the paper hints at potential advancements in deploying PyD-Net on specialized low-power vision processing units like the Intel Movidius NCS, which could further widen its application scope in constrained environments. This step would mark a critical stride forward, paving the way for sophisticated autonomy in embedded systems.

Overall, the research presented in this paper provides a compelling argument for the curation of specialized yet efficient deep learning architectures which can adapt to constrained computational settings, thereby expanding the reach and applicability of AI-driven vision systems beyond the confines of traditional high-power, GPU-reliant approaches.