Papers
Topics
Authors
Recent
2000 character limit reached

Vision-based 3D occupancy prediction in autonomous driving: a review and outlook (2405.02595v2)

Published 4 May 2024 in cs.CV

Abstract: In recent years, autonomous driving has garnered escalating attention for its potential to relieve drivers' burdens and improve driving safety. Vision-based 3D occupancy prediction, which predicts the spatial occupancy status and semantics of 3D voxel grids around the autonomous vehicle from image inputs, is an emerging perception task suitable for cost-effective perception system of autonomous driving. Although numerous studies have demonstrated the greater advantages of 3D occupancy prediction over object-centric perception tasks, there is still a lack of a dedicated review focusing on this rapidly developing field. In this paper, we first introduce the background of vision-based 3D occupancy prediction and discuss the challenges in this task. Secondly, we conduct a comprehensive survey of the progress in vision-based 3D occupancy prediction from three aspects: feature enhancement, deployment friendliness and label efficiency, and provide an in-depth analysis of the potentials and challenges of each category of methods. Finally, we present a summary of prevailing research trends and propose some inspiring future outlooks. To provide a valuable reference for researchers, a regularly updated collection of related papers, datasets, and codes is organized at https://github.com/zya3d/Awesome-3D-Occupancy-Prediction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. Monocular 3d object detection for autonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, 2147–2156
  2. Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 2040–2049
  3. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, 8445–8453
  4. Stereo r-cnn based 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7644–7652
  5. Dsgn: Deep stereo geometry network for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, 12536–12545
  6. 3d object detection from images for autonomous driving: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023
  7. Second: Sparsely embedded convolutional detection. Sensors, 2018, 18(10): 3337
  8. Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, 11784–11793
  9. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. 2021, 1201–1209
  10. Pointrcnn: 3d object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, 770–779
  11. Pc-rgnn: Point cloud completion and graph neural network for 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. 2021, 3430–3437
  12. Octr: Octree-based transformer for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, 5166–5175
  13. Pointpainting: Sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, 4604–4612
  14. Clocs: Camera-lidar object candidates fusion for 3d object detection. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2020, 10386–10393
  15. Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017, 1907–1915
  16. Epnet: Enhancing point features with image semantics for 3d object detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. 2020, 35–52
  17. Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 908–917
  18. Multi-modal 3d object detection in autonomous driving: A survey and taxonomy. IEEE Transactions on Intelligent Vehicles, 2023
  19. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021
  20. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 1477–1485
  21. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European conference on computer vision. 2022, 1–18
  22. Sa-bev: Generating semantic-aware bird’s-eye-view feature for multi-view 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 3348–3357
  23. Vision-centric bev perception: A survey. arXiv preprint arXiv:2208.02797, 2022
  24. Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, 4460–4470
  25. Convolutional occupancy networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. 2020, 523–540
  26. nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, 11621–11631
  27. Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, 2446–2454
  28. Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 21729–21740
  29. Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991, 2023
  30. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv preprint arXiv:2304.14365, 2023
  31. Scene as occupancy. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 8406–8415
  32. Vdbfusion: Flexible and efficient tsdf integration of range sensor data. Sensors, 2022, 22(3): 1296
  33. Indoor segmentation and support inference from rgbd images. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. 2012, 746–760
  34. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, 9297–9307
  35. Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. 2012, 3354–3361
  36. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 9087–9098
  37. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 3101–3109
  38. Cao A Q, Charette d R. Monoscene: Monocular 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 3991–4001
  39. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 1486–1494
  40. Tri-perspective view for vision-based 3d semantic occupancy prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 9223–9232
  41. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023
  42. Symphonize 3d semantic scene completion with contextual instance queries. arXiv preprint arXiv:2306.15670, 2023
  43. Camera-based 3d semantic scene completion with sparse guidance network. arXiv preprint arXiv:2312.05752, 2023
  44. Myeongjin T V J H K, Jeong K S J S G. Milo: Multi-task learning with localization ambiguity suppression for occupancy prediction cvpr 2023 occupancy challenge report. arXiv preprint arXiv:2306.11414, 2023
  45. Fb-bev: Bev representation from forward-backward view transformations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 6919–6928
  46. Stereoscene: Bev-assisted stereo matching empowers 3d semantic scene completion. arXiv preprint arXiv:2303.13959, 2023
  47. Occtransformer: Improving bevformer for 3d camera-only occupancy prediction. arXiv preprint arXiv:2402.18140, 2024
  48. 3d sketch-aware semantic scene completion via semi-supervised structure prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 4193–4202
  49. Anisotropic convolutional networks for 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 3351–3359
  50. Lmscnet: Lightweight multiscale 3d semantic completion. In: 2020 International Conference on 3D Vision (3DV). 2020, 111–119
  51. Fully sparse 3d panoptic occupancy prediction. arXiv preprint arXiv:2312.17118, 2023
  52. Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. arXiv preprint arXiv:2312.03774, 2023
  53. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023
  54. Yao J, Zhang J. Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion. arXiv preprint arXiv:2311.17084, 2023
  55. Multi-scale occ: 4th place solution for cvpr 2023 3d occupancy prediction challenge. arXiv preprint arXiv:2306.11414, 2023
  56. Fastocc: Accelerating 3d occupancy prediction by fusing the 2d bird’s-eye view and perspective view. arXiv preprint arXiv:2403.02710, 2024
  57. Monoocc: Digging into monocular semantic occupancy prediction. arXiv preprint arXiv:2403.08766, 2024
  58. Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. arXiv preprint arXiv:2306.10013, 2023
  59. Ovo: Open-vocabulary occupancy. arXiv preprint arXiv:2305.16133, 2023
  60. Selfocc: Self-supervised vision-based 3d occupancy prediction. arXiv preprint arXiv:2311.12754, 2023
  61. Occnerf: Self-supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243, 2023
  62. Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision. arXiv preprint arXiv:2309.09502, 2023
  63. Uniocc: Unifying vision-centric 3d occupancy prediction with geometric and semantic rendering. arXiv preprint arXiv:2306.09117, 2023
  64. Radocc: Learning cross-modality occupancy knowledge through rendering assisted distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 7060–7068
  65. Occflownet: Towards self-supervised occupancy estimation via differentiable rendering and occupancy flow. arXiv preprint arXiv:2402.12792, 2024
  66. Philion J, Fidler S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. 2020, 194–210
  67. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023
  68. S2tpvformer: Spatio-temporal tri-perspective view for temporally coherent 3d semantic occupancy prediction. arXiv preprint arXiv:2401.13785, 2024
  69. Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction. arXiv preprint arXiv:2308.16896, 2023
  70. U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. 2015, 234–241
  71. Inversematrixvt3d: An efficient projection matrix-based approach for 3d occupancy prediction. arXiv preprint arXiv:2401.12422, 2024
  72. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, 770–778
  73. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 2117–2125
  74. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023
  75. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020
  76. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, 16000–16009
  77. Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, 1290–1299
  78. Cotr: Compact occupancy transformer for vision-based 3d occupancy prediction. arXiv preprint arXiv:2312.01919, 2023
  79. Univision: A unified framework for vision-centric 3d perception. arXiv preprint arXiv:2401.06994, 2024
  80. Learning occupancy for monocular 3d object detection. arXiv preprint arXiv:2305.15694, 2023
  81. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 2021, 65(1): 99–106
  82. A simple attempt for 3d occupancy estimation in autonomous driving. arXiv preprint arXiv:2303.10076, 2023
  83. Regulating intermediate 3d features for vision-centric autonomous driving. arXiv preprint arXiv:2312.11837, 2023
  84. Behind the scenes: Density fields for single view reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 9076–9086
  85. Panacea: Panoramic and controllable video generation for autonomous driving. arXiv preprint arXiv:2311.16813, 2023
  86. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023
  87. Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation. arXiv preprint arXiv:2312.02934, 2023
  88. Occworld: Learning a 3d occupancy world model for autonomous driving. arXiv preprint arXiv:2311.16038, 2023
  89. Uniworld: Autonomous driving pre-training via world models. arXiv preprint arXiv:2308.07234, 2023
  90. Collaborative semantic occupancy prediction with hybrid feature fusion in connected automated vehicles. arXiv preprint arXiv:2402.07635, 2024
  91. Pop-3d: Open-vocabulary 3d occupancy prediction from images. Advances in Neural Information Processing Systems, 2024, 36
  92. Cam4docc: Benchmark for camera-only 4d occupancy forecasting in autonomous driving applications. arXiv preprint arXiv:2311.17663, 2023
Citations (5)

Summary

  • The paper provides a comprehensive review of vision-based 3D occupancy prediction methods using BEV, TPV, and voxel representations.
  • The paper identifies key challenges in generating dense 3D occupancy annotations and evaluates metrics like mIoU and IoU with current datasets.
  • The implications include promising future directions such as synthetic data generation, multi-agent collaboration, and integration of temporal dynamics.

Vision-based 3D Occupancy Prediction in Autonomous Driving: A Review and Outlook

Vision-based 3D occupancy prediction is emerging as a promising perception task in autonomous driving, providing a cost-effective approach to understanding the spatial occupancy and semantics of environments surrounding a vehicle. This paper presents a detailed examination of current approaches, challenges, and future directions for 3D occupancy prediction derived from image inputs.

Challenges in Vision-based 3D Occupancy Prediction

Task Definition and Ground Truth Generation

The primary task of vision-based 3D occupancy prediction is to classify each voxel in a 3D space based on camera inputs as either occupied or unoccupied, with additional semantic classification if occupied. Ground truth for this task is typically derived from LiDAR point clouds, but these are sparse and introduce challenges in generating dense annotations necessary for effective model training. Figure 1

Figure 1: Visual comparison on 3D occupancy annotations.

Generating dense occupancy annotations typically involves fusing multi-frame LiDAR data and addressing the challenges posed by static and dynamic components of the scene. This complex process introduces additional computational and annotation challenges.

Datasets and Evaluation Metrics

Common datasets like SemanticKITTI and nuScenes provide foundational data, but 3D occupancy prediction demands fine-grained semantic detail not fully captured by these datasets. Current evaluation metrics such as Mean Intersection over Union (mIoU) and Intersection over Union (IoU) fall short of reflecting the detailed occupancy dynamics required for robust deployment of autonomous systems.

Taxonomy of Methods

The paper categorizes methods into feature enhancement, deployment-friendly, and label-efficient, each addressing specific challenges in 3D occupancy prediction.

Feature Enhancement Methods

These methods aim to improve the model's ability to discern 3D features from 2D inputs using different representations such as BEV, TPV, and direct operations on voxel representations. Figure 2

Figure 2: Hierarchically-structured taxonomy of vision-based 3D occupancy prediction for autonomous driving.

  • BEV-based Methods: Utilize bird's-eye view representations to extract spatial information, providing robustness against occlusion and depth ambiguities. Figure 3

    Figure 3: Illustration of BEV-based methods.

  • TPV-based Methods: Introduce tri-perspective views to enhance spatial understanding, allowing for a more comprehensive capture of 3D scene geometry. Figure 4

    Figure 4: Illustration of TPV-based methods.

  • Voxel-based Methods: Directly operate on 3D voxel grids for detailed feature extraction, capturing fine-grained spatial details. Figure 5

    Figure 5: Illustration of Voxel-based methods.

Deployment-friendly Methods

These approaches prioritize computational efficiency, employing strategies such as perspective decomposition and coarse-to-fine refinement to reduce resource consumption while maintaining model fidelity. Figure 6

Figure 6: FB-OCC \cite{li2023fb2} applies forward and backword projection to generate dense BEV features, which are unsqueezed for 3D occupancy prediction.

By exploiting view transformations and focusing computational efforts on critical areas or using less resource-intensive computation, these methods strive for real-time applicability.

Label-efficient Methods

Address the expense of annotating data for these tasks. Leveraging neural rendering techniques and unsupervised learning paradigms, these methods aim to eliminate or reduce dependency on labeled datasets. Figure 7

Figure 7: Illustration of TPV-based methods.

Semantic guidance from rendered 2D views allows models to learn effectively without dense 3D supervision.

Future Outlook

Data Generation and World Models

Generating synthetic data using 3D occupancy frameworks offers a promising direction for augmenting training datasets without incurring high costs. Utilizing 3D occupancy in world models can enhance long-term prediction capabilities and dynamic scene understanding.

Multi-agent Collaboration

Collaborative perception across multiple vehicles could overcome limitations of single-agent systems in occlusion and range. Effective multi-agent frameworks can enable comprehensive environmental understanding by sharing perceptions across connected systems.

Task Integration

Future research should focus on integrating open-set recognition and 4D temporal dynamics into 3D occupancy frameworks. Combining spatial and temporal aspects with open vocabulary recognition will address the challenges of dynamic and evolving driving environments.

Conclusion

Vision-based 3D occupancy prediction continues to evolve, with promising strides in feature extraction, computational efficiency, and label efficiency. Addressing these challenges collectively and exploring synergies between methods can significantly aid in the advancement of autonomous driving technologies.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.