FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation (2307.01492v1)

Published 4 Jul 2023 in cs.CV and cs.RO

Abstract: This technical report summarizes the winning solution for the 3D Occupancy Prediction Challenge, which is held in conjunction with the CVPR 2023 Workshop on End-to-End Autonomous Driving and CVPR 23 Workshop on Vision-Centric Autonomous Driving Workshop. Our proposed solution FB-OCC builds upon FB-BEV, a cutting-edge camera-based bird's-eye view perception design using forward-backward projection. On top of FB-BEV, we further study novel designs and optimization tailored to the 3D occupancy prediction task, including joint depth-semantic pre-training, joint voxel-BEV representation, model scaling up, and effective post-processing strategies. These designs and optimization result in a state-of-the-art mIoU score of 54.19% on the nuScenes dataset, ranking the 1st place in the challenge track. Code and models will be released at: https://github.com/NVlabs/FB-BEV.

References (20)

Citations (69)

View on Semantic Scholar

Summary

The paper introduces a dual view transformation that integrates forward and backward projections to enhance 3D occupancy and semantic segmentation.
It employs joint depth-semantic pre-training and voxel-BEV representation, achieving a leading mIoU of 54.19% on the nuScenes dataset.
The model scales with an InternImage-H backbone and uses advanced post-processing strategies to mitigate overfitting and boost robust AV perception.

FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation

The paper "FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation" presents a comprehensive approach to 3D occupancy prediction, showcasing state-of-the-art methodologies in the context of autonomous driving. The research addresses the task of predicting the occupancy status and semantic class of each voxel within a 3D space, crucial for the planning and perception aspects of autonomous vehicles (AVs).

Overview of FB-OCC Solution

The FB-OCC model builds on FB-BEV, a sophisticated bird's-eye view (BEV) perception framework. It leverages forward-backward projection to enhance 3D vision from camera inputs. Notably, the paper explores advancements through:

Joint Depth-Semantic Pre-training: Combining depth estimation with semantic segmentation to enrich geometrical and semantic understanding.
Joint Voxel-BEV Representation: Merging voxel-level data with BEV features for refined occupancy prediction.
Model Scaling and Optimization: Scaling the model while addressing conventional overfitting issues typical in large 3D perception models.
Effective Post-Processing Strategies: Including test-time augmentation and ensemble techniques for performance enhancement.

Methodological Insights

Model Design

FB-OCC integrates both forward and backward projection strategies into a cohesive framework, improving model perception by exploiting the strengths of each approach. The method begins with forward projection to derive an initial voxel representation and continues with backward projection to refine these representations using BEV features. This duality yields a robust understanding of the 3D space, critical for occupancy prediction.

Model Scaling and Pre-Training

To address scaling challenges, FB-OCC employs the InternImage-H backbone, containing one billion parameters, highlighting the utility of extensive pre-training on large datasets like Object365. This enhances both semantic perception and geometrical awareness, achieved through tailored pre-training on tasks such as depth estimation aligned with semantic segmentation.

Post-Processing Techniques

Test-time augmentation and ensemble strategies play a vital role in the post-processing phase. By averaging predictions from various augmented scenarios and combining different models, the approach counters distance-induced accuracy degradation, achieving a significant improvement in mIoU scores.

Experimental Outcomes

The research substantiates its claims through robust experimental evaluations using the nuScenes dataset. The proposed FB-OCC model achieved a leading mIoU score of 54.19%, outperforming existing models and securing the top position in the 3D Occupancy Prediction Challenge.

Implications and Future Work

While the FB-OCC method illustrates the potential for enhanced AV perception, this work invites further exploration into scalable models that maintain efficiency without compromising on detail. The findings also underscore the growing importance of integrating large-scale 2D pre-training with 3D tasks, suggesting avenues for further advancements in semantic understanding and geometry consistency in AV systems.

Future developments could focus on refining model interpretation in complex scenarios and minimizing computational demands through optimized frameworks, potentially integrating multi-sensor data for enriched spatial understanding.

In conclusion, the research in FB-OCC makes significant contributions to the field of autonomous driving, emphasizing the role of sophisticated view transformation and extensive pre-training in enhancing 3D occupancy prediction. Its implications are far-reaching, offering valuable insights for researchers and industry practitioners aiming to advance autonomous vehicle technologies.

PDF Markdown

Related Papers

GitHub

GitHub - NVlabs/FB-BEV: Official PyTorch implementation of FB-BEV & FB-OCC - Forward-backward view transformation for vision-centric autonomous driving perception (632 stars)