Monocular Depth Estimation

Updated 4 March 2026

Monocular Depth Estimation is the process of deriving dense per-pixel depth maps from a single RGB image, relying on deep learning to overcome inherent scale ambiguity.
Modern methods leverage encoder-decoder architectures with CNNs and transformers, utilizing scale-invariant losses and multi-scale features for refined predictions.
Applications include robotics, AR/VR, autonomous driving, and 3D reconstruction, with ongoing research in domain adaptation, interpretability, and robustness.

Monocular depth estimation (MDE) is the task of inferring a dense per-pixel depth map, typically in metric units, from a single monocular RGB image. Unlike stereo or multi-view approaches, MDE seeks to recover spatial scene structure from a single viewpoint, making the problem fundamentally ill-posed due to inherent scale ambiguity and lack of explicit geometric cues. Deep learning methods, especially convolutional neural networks (CNNs) and vision transformers, now dominate the field, achieving high accuracy on benchmarks and enabling MDE's use for robotics, AR/VR, autonomous driving, and large-scale 3D reconstruction. This article surveys key methodological advances, interpretability, domain adaptation, efficiency, robustness, and open challenges in MDE.

1. Fundamental Challenges and Methodological Axes

The central challenge of MDE is scale ambiguity: a monocular image admits infinitely many 3D interpretations related by scale transforms, as the projective geometry discards global depth. Methods are generally categorized into supervised learning (requiring absolute scale ground truth), self-supervised learning (using view synthesis losses), and metric depth estimation frameworks that seek to recover absolute scale even in the absence of direct metric supervision (Zhang, 21 Jan 2025). Classic stereo and multi-frame geometry exploit known baselines or motion to resolve scale, but monocular settings must rely on learned priors, known camera parameters, or geometric constraints (e.g., planar-parallax, road-plane geometry) (Elazab et al., 2024).

Recent approaches integrate additional cues such as semantic segmentation, language-derived priors, or biological heuristics (object size, relative position) to inject external structure and bias the learning process (Auty et al., 2024, Auty et al., 2022). Metric depth models—such as Metric3D and ZoeDepth—learn explicit scale either by calibrating to canonical space or disentangling camera and scene representations, often using mixed or synthetic datasets for broad generalization (Zhang, 21 Jan 2025).

2. Deep Architectures and Loss Functions

State-of-the-art MDE models employ a variety of encoder-decoder architectures, often including skip connections and multi-scale features. CNNs with attention modules (e.g., CBAM, Squeeze-and-Excitation) or transformer backbones are standard for context aggregation (Litvynchuk et al., 26 Sep 2025, Papa et al., 2024, Agarwal et al., 2022). Patch-based strategies refine global predictions with high-resolution local corrections for enhanced detail (Zhang, 21 Jan 2025).

Loss functions are tailored to the task's ill-posed nature:

Scale-invariant (log) loss:

$L_{\mathrm{si}} = \frac{1}{n}\sum_{i=1}^n (\log d_i - \log d^*_i)^2 - \frac{1}{n^2}\left(\sum_{i=1}^n (\log d_i - \log d^*_i)\right)^2$

is fundamental for relative depth (Zhang, 21 Jan 2025, Gurram et al., 2023).

Photometric view synthesis loss and smoothness regularizers drive self-supervised pipelines (Gurram et al., 2021).
Ordinal regression with learned bin centers is widely used, e.g., AdaBins, PixelFormer, and BTS (Agarwal et al., 2022).
Perceptual LPIPS loss and edge loss help preserve fine structure and detail (Litvynchuk et al., 26 Sep 2025).

Advanced decoders (e.g., bimodal density heads (Litvynchuk et al., 26 Sep 2025)) allow estimation of multi-modal depth distributions per pixel, supporting sharp boundary delineation.

3. Interpretability, Semantic and Language Priors

MDE networks historically act as black boxes, with few insights into how internal representations encode geometric reasoning. Recent work introduces explicit interpretability metrics, e.g., depth selectivity: given a hidden unit/channel, its average activation is measured over discrete depth bins, and a selectivity score is computed to quantify how narrowly a unit responds to specific depth ranges (You et al., 2021). Incorporating a depth selectivity regularizer during training (via fixed bin assignments) yields models in which hidden units systematically specialize, interpreted as “depth band detectors.” Notably, this interpretability boost is achieved without sacrificing, and sometimes slightly improving, accuracy.

Semantic and language-derived priors are also injected into MDE pipelines to mitigate ill-posedness. For example, LLMs (BERT) encode distributions of likely object depths and can be used to supply per-pixel “depth hints” via shallow neural networks, provided along with RGB inputs to the depth predictor (Auty et al., 2024). Similar strategies leverage GloVe embeddings and manually estimated physical sizes to supply class‐level size priors (Auty et al., 2022). These cues exploit biological vision heuristics such as relative size, familiar size, and object semantics, and improve boundary sharpness and robustness, particularly in data-limited or cross-domain settings.

4. Domain Adaptation and Test-Time Robustness

MDE models frequently fail under domain shift, as depth statistics vary markedly across environments, weather, sensor calibration, or lighting. Recent strategies mitigate this via:

Test-time adaptation (TTA): PITTA updates BatchNorm parameters in a pre-trained depth network online, guided by instance-aware masking and edge consistency losses computed from panoptic segmentations and image/depth edges—without requiring camera pose or extrinsics (Sung et al., 7 Nov 2025). Adaptation steps are lightweight and avoid catastrophic forgetting, yielding 10–15% reductions in AbsRel in automotive datasets.
On-device learning (ODL): In MCU-constrained IoT settings, a secondary sensor provides sparse or noisy depth pseudo-labels for local fine-tuning of compact models, with innovations such as memory-driven sparse updates to facilitate gradient-based training under tight RAM limits (Nadalini et al., 26 Nov 2025).
Domain-invariant feature learning: Medical applications (e.g., endoscopy) utilize adversarial and directional feature alignment in shared latent spaces to ensure consistency between synthetic-style and real features, achieving substantial gain in absolute and relative metrics (Li et al., 4 Nov 2025).
Self-supervised and virtual-domain mixing: Dual-branch learning with virtual-world (synthetic) and real-world (SfM-derived) supervision, coupled with feature-level adversarial adaptation, closes the domain gap and recovers metric scale without LiDAR or stereo (Gurram et al., 2021, Elazab et al., 2024).

Test-time attacks on MDE, notably physical lens-based adversarial attacks, exploit the camera imaging model to systematically distort estimated depth, raising new security concerns in autonomous navigation (Zhou et al., 2024). Robustness to such attacks is currently addressed via sensor fusion and blur-detection heuristics.

5. Evaluation Protocols, Metrics, and Benchmarks

Multiple error metrics are standard in MDE evaluation:

Absolute Relative Error (AbsRel): $\frac{1}{n}\sum_{i=1}^n \frac{|d_i^* - d_i|}{d_i^*}$ , the most reliable predictor of downstream task (e.g., 3D object detection) accuracy (Gurram et al., 2023).
RMSE, SqRel, RMSE(log): measure absolute and log-space discrepancies.
Threshold accuracy ( $\delta$ ): percent of predictions within a given multiplicative factor (typically 1.25, $1.25^2$ , $1.25^3$ ).
Scale Drift: for video or multi-frame settings, measures frame-to-frame metric consistency.

Leading benchmarks include NYU Depth v2 (indoor), KITTI and Cityscapes (driving, LiDAR ground truth), MegaDepth (Internet images, relative depth), and custom datasets for wildlife or endoscopy (Zhang, 21 Jan 2025, Niccoli et al., 6 Oct 2025, Li et al., 4 Nov 2025). Wildlife-specific benchmarks highlight that median-based estimation of object depth from bounding boxes is more robust to outliers (e.g., foliage), and that scale-aware foundation models (e.g., DepthAnything v2) achieve MAE $<$ 0.5m in outdoor conditions with minimal tuning (Niccoli et al., 6 Oct 2025).

6. Computational Efficiency, Scalability, and Embedded Deployment

Modern MDE networks balance computational cost with accuracy and detail:

Mobile ViT hybrids (e.g., METER): Alternate MobileNetV2 convolutions and lightweight transformer blocks enable low-power inference ( $<$ 4 GB memory, $\sim$ 16–25 fps) on embedded hardware without major loss in accuracy (Papa et al., 2024).
Bimodal density heads: Mixture models per pixel allow crisp boundaries while maintaining low parameter count, as in EfficientDepth, which combines a MiT-B5 transformer encoder with a lightweight UNet decoder for 0.055 s per $736\times736$ image (18 FPS) and state-of-the-art accuracy (Litvynchuk et al., 26 Sep 2025).
On-Device Learning: μPyD-Net demonstrates sub-0.1M parameter full pipeline learning—including local fine-tuning—in $<$ 20 minutes and $<$ 300 mW power envelopes on microcontrollers (Nadalini et al., 26 Nov 2025).

Patch-based refiners and pseudo-labeling via “SimpleBoost” or similar techniques enable high-resolution predictions from low-capacity models, crucial for edge deployment (Zhang, 21 Jan 2025).

7. Open Problems, Trends, and Future Directions

Despite progress, significant challenges persist:

Generalization and robustness: Domain adaptation across disparate scenes remains imperfect; further work on dataset diversity, generative modeling (e.g., diffusion priors), and continual adaptation is ongoing (Zhang, 21 Jan 2025, Litvynchuk et al., 26 Sep 2025).
Boundary blurring: Strategies such as edge- and normal-based losses, patch-based structure refinement, and generative diffusion methods improve fine geometry but require further integration for consistent metric predictions.
Metric scale recovery: Camera-agnostic learning, planar-parallax geometry, and unsupervised scale calibration are increasingly deployed for direct metric depth estimation, but their effectiveness outside structured domains (roads, flat planes) remains limited (Elazab et al., 2024).
Interpretability and trustworthiness: Enforced depth selectivity and explicit cue injection open avenues for transparent, trustable systems, especially in high-stakes use cases (You et al., 2021).
Multi-modal and multi-task integration: Combining monocular cues with sparse LiDAR or semantic priors, and joint learning across depth, semantics, normals, and language, promises more robust and semantically enriched depth representations (Quercia et al., 22 Jan 2025, Auty et al., 2024).
Real-world attack resilience: Physical and digital attacks on camera systems necessitate research into adversarial defense, sensor fusion, and self-diagnosing models (Zhou et al., 2024).

Continued advances in architectural innovation, curriculum learning, auxiliary-task training, and large-scale pseudo-labeling are driving MDE towards robust, scalable, and interpretable real-world deployment across domains (Zhang, 21 Jan 2025, Sung et al., 7 Nov 2025, Quercia et al., 22 Jan 2025).