MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
Abstract: Monocular 3D object detection has long been a challenging task in autonomous driving. Most existing methods follow conventional 2D detectors to first localize object centers, and then predict 3D attributes by neighboring features. However, only using local visual features is insufficient to understand the scene-level 3D spatial structures and ignores the long-range inter-object depth relations. In this paper, we introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR. We modify the vanilla transformer to be depth-aware and guide the whole detection process by contextual depth cues. Specifically, concurrent to the visual encoder that captures object appearances, we introduce to predict a foreground depth map, and specialize a depth encoder to extract non-local depth embeddings. Then, we formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions. In this way, each object query estimates its 3D attributes adaptively from the depth-guided regions on the image and is no longer constrained to local visual features. On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations. Besides, our depth-guided modules can also be plug-and-play to enhance multi-view 3D object detectors on nuScenes dataset, demonstrating our superior generalization capacity. Code is available at https://github.com/ZrrSkywalker/MonoDETR.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces MonoDETR, a new AI method that finds 3D objects (like cars) using just one camera image. Think of trying to tell how far away things are with one eye closed—that’s hard. MonoDETR helps the computer “feel” depth (distance) from a single picture by guiding its attention with a learned map of what’s near and far. This makes 3D detection more accurate without needing extra sensors like LiDAR.
What Questions Does the Paper Try to Answer?
Here are the main questions the authors wanted to solve:
- How can we make a computer understand 3D space (distances and sizes) from a single photo?
- Can we use “depth clues” from the image to guide a modern detection model so it looks in the right places?
- Can this approach work well without extra depth sensors or heavy extra data?
- Will this idea also help when there are multiple cameras around a car?
How Did They Do It? (Methods Explained Simply)
MonoDETR uses a special neural network called a Transformer—a model that’s great at paying attention to the most important parts of data. It’s based on DETR, a popular detection framework that treats finding objects like asking questions about where they might be.
To make that work for 3D from one image, MonoDETR adds “depth guidance.” Here’s the big idea:
- Two “teachers” look at the same image:
- A visual teacher that focuses on how things look (colors, shapes, textures).
- A depth teacher that tries to estimate how far things are (near vs. far).
- A simple depth predictor creates a “foreground depth map.” Picture this like a heatmap that tells the network which object areas are closer or farther. It doesn’t need perfect depth for every pixel; it uses bucketed distance labels (like putting distances into “bins” or “buckets”) for the objects. This avoids needing full, expensive depth data.
- The Transformer then uses “object queries,” which are like little detectives that each search for a possible object. These queries: 1) Look at the depth map first to figure out where depth hints suggest an object might be (depth cross-attention). 2) Talk to each other to avoid duplicates and share clues (self-attention). 3) Look at the image’s visual features to recognize what the object is (visual cross-attention).
Analogy: Imagine you’re looking at a busy street photo. First, you scan for areas that “feel” close or far (depth hints), then you discuss with friends to avoid all pointing at the same car, and finally you zoom in to confirm “yep, that’s a car and it’s about this big and this far.”
Key terms in everyday language:
- Depth map: a guide that tells the model which parts of the image are likely closer or farther.
- Attention: a spotlight the model uses to focus on the most important image parts.
- Queries: small search agents that try to find and describe one object each.
- Bins: buckets for distance (e.g., 0–5m, 5–10m, etc.) so the model doesn’t need exact depth for every pixel.
MonoDETR trains end-to-end, meaning it learns everything together. It doesn’t need hand-tuned rules like NMS (a cleanup step many detectors use) or extra sensors like LiDAR.
What Did They Find? (Main Results)
The authors tested MonoDETR on standard self-driving datasets:
- KITTI (single front camera):
- It reached state-of-the-art performance for detecting cars in 3D using only one image.
- It beat the second-best method by about:
- +2.53% (Easy), +1.08% (Moderate), +0.85% (Hard) in 3D average precision.
- And it did this without extra dense depth labels or LiDAR.
- nuScenes (multiple cameras around the car):
- Their depth-guided module can be plugged into other detectors to improve them.
- Adding it to PETRv2 improved overall score (NDS) by +1.2%.
- Adding it to BEVFormer improved NDS by +0.9%.
- This shows the idea works beyond just single images.
Why this matters:
- Getting accurate 3D information from one camera is tough, but it’s cheaper and more common than LiDAR.
- Better depth-guided attention helps the model understand the whole scene, not just tiny areas around object centers.
Why Is This Important? (Implications and Impact)
- Safer and more affordable self-driving: If one camera can reliably understand 3D, cars can be cheaper and still safe.
- Less dependence on expensive sensors: No need for LiDAR or full-depth maps during training.
- General and flexible: The depth-guided parts can be added to other systems to improve them.
- A new baseline: MonoDETR shows that guiding attention with depth cues is a strong direction for future research in 3D detection.
Quick Recap
- Problem: 3D detection from a single image is hard because depth is missing.
- Idea: Teach the model to use a learned depth map to guide where it looks.
- How: Two encoders (visual + depth) feed a depth-guided Transformer with object queries and attention.
- Results: Best-in-class on KITTI (single camera) and boosts other multi-camera methods on nuScenes.
- Impact: More accurate, flexible, and cost-friendly 3D perception.
Collections
Sign up for free to add this paper to one or more collections.