Emergent Mind

Abstract

We propose HYBRIDDEPTH, a robust depth estimation pipeline that addresses the unique challenges of depth estimation for mobile AR, such as scale ambiguity, hardware heterogeneity, and generalizability. HYBRIDDEPTH leverages the camera features available on mobile devices. It effectively combines the scale accuracy inherent in Depth from Focus (DFF) methods with the generalization capabilities enabled by strong single-image depth priors. By utilizing the focal planes of a mobile camera, our approach accurately captures depth values from focused pixels and applies these values to compute scale and shift parameters for transforming relative depths into metric depths. We test our pipeline as an end-to-end system, with a newly developed mobile client to capture focal stacks, which are then sent to a GPU-powered server for depth estimation. Through comprehensive quantitative and qualitative analyses, we demonstrate that HYBRIDDEPTH not only outperforms state-of-the-art (SOTA) models in common datasets (DDFF12, NYU Depth v2) and a real-world AR dataset ARKitScenes but also demonstrates strong zero-shot generalization. For example, HYBRIDDEPTH trained on NYU Depth v2 achieves comparable performance on the DDFF12 to existing models trained on DDFF12. it also outperforms all the SOTA models in zero-shot performance on the ARKitScenes dataset. Additionally, we conduct a qualitative comparison between our model and the ARCore framework, demonstrating that our models output depth maps are significantly more accurate in terms of structural details and metric accuracy. The source code of this project is available at github.

HybridDepth's three-stage process: focal stack capture, scale and shift calculation, and depth map refinement.

Overview

  • HybridDepth presents a robust pipeline for mobile augmented reality (AR) depth estimation, leveraging depth from focus (DFF) and single-image depth priors to overcome scale ambiguity, hardware heterogeneity, and generalizability challenges.

  • The end-to-end system integrates focal stack information with relative depth estimation, outperforming state-of-the-art methods on benchmarks like NYU Depth v2, DDFF12, and ARKitScenes while demonstrating superior zero-shot performance.

  • HybridDepth's efficient and accurate depth maps make it suitable for real-time mobile applications, achieving shorter inference times and smaller model size compared to existing models, and eliminating the need for specialized hardware like LiDAR or ToF sensors.

HybridDepth: Robust Depth Fusion for Mobile AR by Leveraging Depth from Focus and Single-Image Priors

Introduction

HybridDepth presents a robust pipeline for depth estimation explicitly designed for mobile augmented reality (AR) applications. The system addresses typical challenges inherent in depth estimation for mobile AR, including scale ambiguity, hardware heterogeneity, and generalizability. By leveraging both depth from focus (DFF) and single-image depth priors, HybridDepth achieves accurate and generalizable depth maps, outperforming state-of-the-art (SOTA) models across multiple benchmarks.

Key Contributions

HybridDepth's primary contributions are threefold:

  1. End-to-End Pipeline: HybridDepth integrates focal stack information with relative depth estimation to achieve robust metric depth maps, ensuring it can operate with the limited hardware capabilities of mobile devices.
  2. State-of-the-Art Performance: HybridDepth outperforms existing SOTA methods, including recent advancements such as DepthAnything, on datasets like NYU Depth v2, DDFF12, and ARKitScenes.
  3. Generalization: The model demonstrates superior zero-shot performance, particularly on AR-specific datasets, highlighting its robustness across diverse and unforeseen environments.

Methodology

HybridDepth employs a three-phase approach:

  1. Capture Relative and Metric Depth: Two modules are utilized: a single-image relative depth estimator and a DFF metric depth estimator. The single-image model lays the structural foundation while the DFF estimator provides the necessary metric depth.
  2. Least-Squares Fitting: This phase aligns the scale of relative depth estimates with metric depths derived from the DFF model using least-squares fitting.
  3. Refinement Layer: A final deep learning-based refinement layer corrects and fine-tunes the intermediate depth map using a locally-adaptive scale map, derived from the DFF branch and the globally scaled depth map.

Results and Evaluation

NYU Depth v2 and DDFF12: HybridDepth achieves a 13% improvement in RMSE and AbsRel metrics on the NYU Depth v2 dataset compared to recent approaches. On DDFF12, it demonstrates superior performance, showcasing its utility even on data with large texture-less areas.

ARKitScenes: In both zero-shot and trained evaluations, HybridDepth achieves unprecedented performance, with an RMSE of 0.367 and 0.254 in zero-shot and trained settings, respectively. This highlights the model's ability to handle complex, AR-specific scenarios effectively.

Model Efficiency: When compared to SOTA models like ZoeDepth and DepthAnything, HybridDepth offers significantly shorter inference times and a smaller model size, making it highly suitable for real-time mobile applications.

Implications and Future Work

Practical Implications: The primary implication of HybridDepth lies in its enhanced deployment feasibility on mobile devices due to its ability to operate with only standard cameras. This characteristic obviates the need for specialized hardware such as LiDAR or ToF sensors, broadening its applicability across varied device ecosystems.

Theoretical Implications: By combining DFF and single-image priors, HybridDepth presents a novel methodological framework that can be extended to other domains requiring robust depth estimation. It sets a benchmark for integrating multi-source depth cues while maintaining computational efficiency and generalizability.

Future Work: Although HybridDepth shows great promise, further improvements can target the DFF branch to mitigate scaling errors, particularly for pixels lacking optimal focus data. A focus on selective depth value extraction in the DFF process could enhance overall accuracy and reliability.

Conclusion

HybridDepth marks a significant advancement in depth estimation for mobile AR, leveraging the strengths of DFF and relative depth estimation to produce robust, accurate, and efficient depth maps. Its strong numerical results across multiple datasets and its superior performance in real-world applications underscore its potential as a practical and scalable solution for the future of mobile AR experiences.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.