FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation (2307.01492v1)
Abstract: This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop. Our proposed solution FB-OCC builds upon FB-BEV, a cutting-edge camera-based bird's-eye view perception design using forward-backward projection. On top of FB-BEV, we further study novel designs and optimization tailored to the 3D occupancy prediction task, including joint depth-semantic pre-training, joint voxel-BEV representation, model scaling up, and effective post-processing strategies. These designs and optimization result in a state-of-the-art mIoU score of 54.19% on the nuScenes dataset, ranking the 1st place in the challenge track. Code and models will be released at: https://github.com/NVlabs/FB-BEV.
- Planning-oriented autonomous driving. In CVPR, 2023.
- BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022.
- BEVDet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv:2112.11790, 2021.
- BEVDepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- M22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTBEV: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv:2204.05088, 2022.
- MonoScene: Monocular 3d semantic scene completion. In CVPR, 2022.
- OpenOccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv:2303.03991, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- A convnet for the 2020s. In CVPR, 2022.
- An energy and gpu-computation efficient backbone network for real-time object detection. In CVPR Workshops, 2019.
- InternImage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023.
- nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Segment anything. arXiv:2304.02643, 2023.
- Microsoft. Neural Network Intelligence. https://github.com/microsoft/nni, 2011.
- Occ3D: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv:2304.14365, 2023.
- Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. In ICLR, 2023.
- Deep residual learning for image recognition. In CVPR, 2016.
- Vision transformer adapter for dense predictions. In ICLR, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.