Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction (2312.01919v2)

Published 4 Dec 2023 in cs.CV

Abstract: The autonomous driving community has shown significant interest in 3D occupancy prediction, driven by its exceptional geometric perception and general object recognition capabilities. To achieve this, current works try to construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation extending from the Bird-Eye-View perception. However, compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but redundant computational costs. To address the above limitations, we propose Compact Occupancy TRansformer (COTR), with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. The occupancy encoder first generates a compact geometrical OCC feature through efficient explicit-implicit view transformation. Then, the occupancy decoder further enhances the semantic discriminability of the compact OCC representation by a coarse-to-fine semantic grouping strategy. Empirical experiments show that there are evident performance gains across multiple baselines, e.g., COTR outperforms baselines with a relative improvement of 8%-15%, demonstrating the superiority of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 11621–11631, 2020.
  2. Ace: Ally complementary experts for solving long-tailed recognition in one-shot. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 112–121, 2021.
  3. Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3991–4001, 2022.
  4. Group detr: Fast detr training with group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6633–6642, 2023.
  5. 3d sketch-aware semantic scene completion via semi-supervised structure prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4193–4202, 2020.
  6. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems (NIPS), 34:17864–17875, 2021.
  7. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
  8. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022.
  9. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  10. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9223–9232, 2023.
  11. Polarformer: Multi-camera 3d object detection with polar transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 1042–1050, 2023.
  12. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. arXiv preprint arXiv:2209.05324, 2022a.
  13. Anisotropic convolutional networks for 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3351–3359, 2020a.
  14. Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 10991–11000, 2020b.
  15. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 1486–1494, 2023a.
  16. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 1477–1485, 2023b.
  17. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023c.
  18. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision (ECCV), pages 1–18. Springer, 2022b.
  19. Fb-bev: Bev representation from forward-backward view transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6919–6928, 2023d.
  20. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems (NIPS), 35:10421–10434, 2022.
  21. Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision (ECCV), pages 531–548. Springer, 2022.
  22. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 10012–10022, 2021.
  23. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  24. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023.
  25. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In European conference on computer vision (ECCV), pages 194–210. Springer, 2020.
  26. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8555–8564, 2021.
  27. Lmscnet: Lightweight multiscale 3d semantic completion. In 2020 International Conference on 3D Vision (3DV), pages 111–119. IEEE, 2020.
  28. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241. Springer, 2015.
  29. Semantic scene completion from a single depth image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1746–1754, 2017.
  30. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 2446–2454, 2020.
  31. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023.
  32. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8406–8415, 2023.
  33. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17850–17859, 2023a.
  34. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022.
  35. Frustumformer: Adaptive instance-aware resampling for multi-view 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5096–5105, 2023b.
  36. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21729–21740, 2023.
  37. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In European conference on computer vision (ECCV), pages 247–263. Springer, 2020.
  38. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 3101–3109, 2021.
  39. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17830–17839, 2023.
  40. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9433–9443, 2023.
Citations (19)

Summary

We haven't generated a summary for this paper yet.