Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Instance-aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning (2312.08004v1)

Published 13 Dec 2023 in cs.CV

Abstract: Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. Under such a paradigm, accurate BEV representation construction relies on reliable depth estimation for multi-camera images. However, existing approaches exhaustively predict depths for every pixel without prioritizing objects, which are precisely the entities requiring detection in the 3D space. To this end, we propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector. First, a category-specific structural priors mining approach is proposed for enhancing the efficacy of monocular depth generation. Besides, a self-boosting learning strategy is further proposed to encourage the model to place more emphasis on challenging objects in computation-expensive temporal stereo matching. Together they provide advanced depth estimation results for high-quality BEV features construction, benefiting the ultimate 3D detection. The proposed method achieves state-of-the-art performances on the challenging nuScenes benchmark, and extensive experimental results demonstrate the effectiveness of our designs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4009–4018.
  2. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11621–11631.
  3. End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
  4. OA-BEV: Bringing Object Awareness to Bird’s-Eye-View Representation for Multi-Camera 3D Object Detection. arXiv preprint arXiv:2301.05711.
  5. Contributors, S. 2022. Spconv: Spatially Sparse Convolution Library. https://github.com/traveller59/spconv.
  6. Aedet: Azimuth-invariant multi-view 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21580–21588.
  7. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  8. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141.
  9. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17853–17862.
  10. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790.
  11. MSMDfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21643–21652.
  12. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with dynamic temporal stereo. arXiv preprint arXiv:2209.10248.
  13. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 1477–1485.
  14. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. arXiv preprint arXiv:2203.14211.
  15. BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection. arXiv preprint arXiv:2312.01696.
  16. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, 1–18. Springer.
  17. FB-BEV: BEV Representation from Forward-Backward View Transformations. arXiv preprint arXiv:2308.02236.
  18. Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision, 531–548. Springer.
  19. Petrv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256.
  20. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11976–11986.
  21. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv preprint arXiv:2210.02443.
  22. Rethinking depth estimation for multi-view stereo: A unified representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8645–8654.
  23. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 194–210. Springer.
  24. On the uncertainty of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3227–3237.
  25. Cfnet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13906–13915.
  26. Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10548–10557.
  27. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, 180–191. PMLR.
  28. Object as query: Equipping any 2d object detector with 3d detection ability. arXiv preprint arXiv:2301.02364.
  29. Sts: Surround-view temporal stereo for multi-view 3d detection. arXiv preprint arXiv:2208.10145.
  30. Aa-rmvsnet: Adaptive aggregation recurrent multi-view stereo network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6187–6196.
  31. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation. arXiv preprint arXiv:2204.05088.
  32. Cape: Camera view position embedding for multi-view 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21570–21579.
  33. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17830–17839.
  34. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), 767–783.
  35. SA-BEV: Generating Semantic-Aware Bird’s-Eye-View Feature for Multi-view 3D Object Detection. arXiv preprint arXiv:2307.11477.
  36. MonoDETR: depth-guided transformer for monocular 3D object detection. arXiv preprint arXiv:2203.13310.
  37. Lif-seg: Lidar and camera image fusion for 3d lidar semantic segmentation. IEEE Transactions on Multimedia.
  38. Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461.
  39. Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492.
Citations (4)

Summary

We haven't generated a summary for this paper yet.