Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision-based 3D occupancy prediction in autonomous driving: a review and outlook (2405.02595v2)

Published 4 May 2024 in cs.CV

Abstract: In recent years, autonomous driving has garnered escalating attention for its potential to relieve drivers' burdens and improve driving safety. Vision-based 3D occupancy prediction, which predicts the spatial occupancy status and semantics of 3D voxel grids around the autonomous vehicle from image inputs, is an emerging perception task suitable for cost-effective perception system of autonomous driving. Although numerous studies have demonstrated the greater advantages of 3D occupancy prediction over object-centric perception tasks, there is still a lack of a dedicated review focusing on this rapidly developing field. In this paper, we first introduce the background of vision-based 3D occupancy prediction and discuss the challenges in this task. Secondly, we conduct a comprehensive survey of the progress in vision-based 3D occupancy prediction from three aspects: feature enhancement, deployment friendliness and label efficiency, and provide an in-depth analysis of the potentials and challenges of each category of methods. Finally, we present a summary of prevailing research trends and propose some inspiring future outlooks. To provide a valuable reference for researchers, a regularly updated collection of related papers, datasets, and codes is organized at https://github.com/zya3d/Awesome-3D-Occupancy-Prediction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Monocular 3d object detection for autonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, 2147–2156
  2. Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 2040–2049
  3. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, 8445–8453
  4. Stereo r-cnn based 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7644–7652
  5. Dsgn: Deep stereo geometry network for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, 12536–12545
  6. 3d object detection from images for autonomous driving: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023
  7. Second: Sparsely embedded convolutional detection. Sensors, 2018, 18(10): 3337
  8. Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, 11784–11793
  9. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. 2021, 1201–1209
  10. Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, 770–779
  11. Pc-rgnn: Point cloud completion and graph neural network for 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. 2021, 3430–3437
  12. Octr: Octree-based transformer for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, 5166–5175
  13. Pointpainting: Sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, 4604–4612
  14. Clocs: Camera-lidar object candidates fusion for 3d object detection. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2020, 10386–10393
  15. Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017, 1907–1915
  16. Epnet: Enhancing point features with image semantics for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. 2020, 35–52
  17. Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 908–917
  18. Multi-modal 3d object detection in autonomous driving: A survey and taxonomy. IEEE Transactions on Intelligent Vehicles, 2023
  19. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021
  20. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 1477–1485
  21. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European conference on computer vision. 2022, 1–18
  22. Sa-bev: Generating semantic-aware bird’s-eye-view feature for multi-view 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 3348–3357
  23. Vision-centric bev perception: A survey. arXiv preprint arXiv:2208.02797, 2022
  24. Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, 4460–4470
  25. Convolutional occupancy networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. 2020, 523–540
  26. nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, 11621–11631
  27. Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, 2446–2454
  28. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 21729–21740
  29. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023
  30. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023
  31. Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 8406–8415
  32. Vdbfusion: Flexible and efficient tsdf integration of range sensor data. Sensors, 2022, 22(3): 1296
  33. Indoor segmentation and support inference from rgbd images. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. 2012, 746–760
  34. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, 9297–9307
  35. Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. 2012, 3354–3361
  36. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 9087–9098
  37. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 3101–3109
  38. Cao A Q, Charette d R. Monoscene: Monocular 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 3991–4001
  39. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 1486–1494
  40. Tri-perspective view for vision-based 3d semantic occupancy prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 9223–9232
  41. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023
  42. Symphonize 3d semantic scene completion with contextual instance queries. arXiv preprint arXiv:2306.15670, 2023
  43. Camera-based 3d semantic scene completion with sparse guidance network. arXiv preprint arXiv:2312.05752, 2023
  44. Myeongjin T V J H K, Jeong K S J S G. Milo: Multi-task learning with localization ambiguity suppression for occupancy prediction cvpr 2023 occupancy challenge report. arXiv preprint arXiv:2306.11414, 2023
  45. Fb-bev: Bev representation from forward-backward view transformations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 6919–6928
  46. Stereoscene: Bev-assisted stereo matching empowers 3d semantic scene completion. arXiv preprint arXiv:2303.13959, 2023
  47. Occtransformer: Improving bevformer for 3d camera-only occupancy prediction. arXiv preprint arXiv:2402.18140, 2024
  48. 3d sketch-aware semantic scene completion via semi-supervised structure prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 4193–4202
  49. Anisotropic convolutional networks for 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 3351–3359
  50. Lmscnet: Lightweight multiscale 3d semantic completion. In: 2020 International Conference on 3D Vision (3DV). 2020, 111–119
  51. Fully sparse 3d panoptic occupancy prediction. arXiv preprint arXiv:2312.17118, 2023
  52. Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. arXiv preprint arXiv:2312.03774, 2023
  53. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023
  54. Yao J, Zhang J. Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion. arXiv preprint arXiv:2311.17084, 2023
  55. Multi-scale occ: 4th place solution for cvpr 2023 3d occupancy prediction challenge. arXiv preprint arXiv:2306.11414, 2023
  56. Fastocc: Accelerating 3d occupancy prediction by fusing the 2d bird’s-eye view and perspective view. arXiv preprint arXiv:2403.02710, 2024
  57. Monoocc: Digging into monocular semantic occupancy prediction. arXiv preprint arXiv:2403.08766, 2024
  58. Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. arXiv preprint arXiv:2306.10013, 2023
  59. Ovo: Open-vocabulary occupancy. arXiv preprint arXiv:2305.16133, 2023
  60. Selfocc: Self-supervised vision-based 3d occupancy prediction. arXiv preprint arXiv:2311.12754, 2023
  61. Occnerf: Self-supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243, 2023
  62. Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. arXiv preprint arXiv:2309.09502, 2023
  63. Uniocc: Unifying vision-centric 3d occupancy prediction with geometric and semantic rendering. arXiv preprint arXiv:2306.09117, 2023
  64. Radocc: Learning cross-modality occupancy knowledge through rendering assisted distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 7060–7068
  65. Occflownet: Towards self-supervised occupancy estimation via differentiable rendering and occupancy flow. arXiv preprint arXiv:2402.12792, 2024
  66. Philion J, Fidler S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. 2020, 194–210
  67. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023
  68. S2tpvformer: Spatio-temporal tri-perspective view for temporally coherent 3d semantic occupancy prediction. arXiv preprint arXiv:2401.13785, 2024
  69. Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction. arXiv preprint arXiv:2308.16896, 2023
  70. U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. 2015, 234–241
  71. Inversematrixvt3d: An efficient projection matrix-based approach for 3d occupancy prediction. arXiv preprint arXiv:2401.12422, 2024
  72. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, 770–778
  73. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 2117–2125
  74. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023
  75. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020
  76. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, 16000–16009
  77. Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, 1290–1299
  78. Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction. arXiv preprint arXiv:2312.01919, 2023
  79. Univision: A unified framework for vision-centric 3d perception. arXiv preprint arXiv:2401.06994, 2024
  80. Learning occupancy for monocular 3d object detection. arXiv preprint arXiv:2305.15694, 2023
  81. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 2021, 65(1): 99–106
  82. A simple attempt for 3d occupancy estimation in autonomous driving. arXiv preprint arXiv:2303.10076, 2023
  83. Regulating intermediate 3d features for vision-centric autonomous driving. arXiv preprint arXiv:2312.11837, 2023
  84. Behind the scenes: Density fields for single view reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 9076–9086
  85. Panacea: Panoramic and controllable video generation for autonomous driving. arXiv preprint arXiv:2311.16813, 2023
  86. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023
  87. Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation. arXiv preprint arXiv:2312.02934, 2023
  88. Occworld: Learning a 3d occupancy world model for autonomous driving. arXiv preprint arXiv:2311.16038, 2023
  89. Uniworld: Autonomous driving pre-training via world models. arXiv preprint arXiv:2308.07234, 2023
  90. Collaborative semantic occupancy prediction with hybrid feature fusion in connected automated vehicles. arXiv preprint arXiv:2402.07635, 2024
  91. Pop-3d: Open-vocabulary 3d occupancy prediction from images. Advances in Neural Information Processing Systems, 2024, 36
  92. Cam4docc: Benchmark for camera-only 4d occupancy forecasting in autonomous driving applications. arXiv preprint arXiv:2311.17663, 2023
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yanan Zhang (39 papers)
  2. Jinqing Zhang (6 papers)
  3. Zengran Wang (4 papers)
  4. Junhao Xu (19 papers)
  5. Di Huang (203 papers)
Citations (5)

Summary

  • The paper provides an in-depth review of vision-based 3D occupancy prediction by categorizing methods into feature enhancement, deployment-friendly, and label-efficient approaches.
  • It demonstrates that multi-view and voxel-based techniques yield competitive mIoU scores and reduced computational costs, enabling more effective real-time performance.
  • The paper outlines future directions that advocate for unified frameworks and self-supervised strategies to advance dynamic, 4D perception in autonomous driving.

Vision-based 3D Occupancy Prediction in Autonomous Driving: Insights and Developments

Vision-based 3D occupancy prediction has emerged as a pivotal perception task in autonomous driving aimed at predicting the spatial occupancy status and semantics of 3D voxel grids from image inputs. This paper provides an in-depth review of the progress, challenges, and future directions in the field. As an increasingly promising approach, vision-based 3D occupancy offers fine-grained spatial representation, critical for autonomous navigation and the detection of undefined, long-tail obstacles.

The paper categorically analyzes existing methods from three perspectives: feature enhancement, deployment friendliness, and label efficiency. This structured approach not only lays out a comprehensive landscape of the current methodologies but also presents a robust framework for comparing them. Each category addresses specific challenges; for instance, feature enhancement focuses on improving 3D feature extraction from 2D images, thereby enhancing the semantic and spatial accuracy of predictions.

Feature Enhancement Methods

In the domain of feature enhancement, BEV, TPV, and voxel-based representations constitute the central axis for development. Methods such as TPVFormer leverage multi-view representations to effectively bridge the gap between 2D and 3D spaces, while strategies like VoxFormer utilize learned queries for improved voxelization. Strong numerical results indicate that these methods yield significant improvements in semantic segmentation accuracy, as evidenced by mIoU scores reaching competitive levels on benchmark datasets like Occ3D-nuScenes.

Deployment-friendly Methods

Given the high computational costs of 3D processes, deployment-friendly methods, such as FlashOcc, aim to minimize memory and latency while maintaining performance. The application of perspective decomposition techniques and the coarse-to-fine learning paradigms provide promising avenues for efficient 3D occupancy computation. The focus on reducing computational costs without significant losses in accuracy is crucial for real-time applications in autonomous vehicles.

Label-efficient Methods

The paper also explores label-efficient methods that eschew dense annotations, which are often costly and impractical at scale. Techniques like UniOcc that employ neural rendering significantly reduce annotation dependency by leveraging 2D supervision. This initiative aligns with future goals of achieving robust self-supervised frameworks in 3D occupancy prediction.

Implications and Future Outlook

The authors underscore the significance of continuous development in generating more realistic driving scenarios through synthetic data and advancements in world model frameworks. The exploration of multi-agent collaborative perception paves the way for more holistic environmental awareness by integrating features from multiple vehicles.

Nevertheless, the paper identifies an overarching need for a unified framework encompassing feature enhancements, efficiency, and minimal labels, suggesting that future work should focus on amalgamating these aspects. Furthermore, the adaptation of open-vocabulary and dynamic (4D) perception capabilities stand out as critical areas, promising to enhance the depth and breadth of the autonomous vehicle's understanding of real-world environments.

The proposed outlook offers a coherent path forward, urging researchers to explore innovative solutions that intertwine theoretical advancements with practical applications. As the field progresses, vision-based 3D occupancy prediction remains an essential component in realizing the full potential of autonomous driving technologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com