OccFeat: Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation Networks (2404.14027v3)
Abstract: We introduce a self-supervised pretraining method, called OccFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach. Repository: https://github.com/valeoai/Occfeat
- Stretchbev: Stretching future instance prediction spatially and temporally. In ECCV, 2022.
- Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
- Learning by reconstruction produces uninformative features for perception. arXiv preprint arXiv:2402.11337, 2024.
- BEiT: BERT pre-training of image transformers. In ICLR, 2022.
- Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.
- Lara: Latents and rays for multi-camera bird’s-eye-view semantic segmentation. In CoRL, 2022.
- Also: Automotive lidar self-supervision by occupancy estimation. In CVPR, 2023.
- Plop: Probabilistic polynomial objects trajectory planning for autonomous driving. In CoRL, 2020.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
- Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Multisiam: Self-supervised multi-instance siamese representation learning for autonomous driving. In ICCV, 2021.
- Polar parametrization for vision-based surround-view 3d detection. arXiv preprint arXiv:2206.10965, 2022.
- A simple framework for contrastive learning of visual representations. In ICML, 2020a.
- Exploring simple siamese representation learning. In CVPR, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
- Bevdistill: Cross-modal bev distillation for multi-view 3d object detection. In ICLR, 2023.
- Unsupervised visual representation learning by context prediction. In CVPR, 2015.
- Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
- Obow: Online bag-of-visual-words generation for self-supervised learning. In CVPR, 2021.
- Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR, 2024.
- Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
- Simple-bev: What really matters for multi-sensor bev perception? In ICRA, 2023.
- Deep residual learning for image recognition. In CVPR, 2016.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In ICML, 2020.
- Fishing net: Future inference of semantic heatmaps in grids. In CVPR, 2020.
- Cross-modality knowledge distillation network for monocular 3d object detection. In ECCV, 2022.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021.
- Planning-oriented autonomous driving. In CVPR, 2023.
- Bevpoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111, 2022.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Geometric-aware pretraining for vision-centric 3d object detection. arXiv preprint arXiv:2304.03105, 2023a.
- Tig-bev: Multi-view bev 3d object detection via target inner-geometry learning. arXiv preprint arXiv:2212.13979, 2022.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, 2023b.
- Polarformer: Multi-camera 3d object detection with polar transformer. In AAAI, 2023.
- Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022a.
- Bi-mapper: Holistic bev semantic mapping for autonomous driving. RA-L, 2023a.
- Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In AAAI, 2023b.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, 2023c.
- Fast-bev: A fast and strong bird’s-eye view perception baseline. arXiv preprint arXiv:2301.12511, 2023d.
- Bevstereo++: Accurate depth estimation in multi-view 3d object detection via dynamic temporal stereo. arXiv preprint arXiv:2304.04185, 2023e.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022b.
- Fb-bev: Bev representation from forward-backward view transformations. In ICCV, 2023f.
- Geomim: Towards better 3d knowledge transfer via masked image modeling for multi-view 3d understanding. In CVPR, 2023a.
- Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, 2022.
- Petrv2: A unified framework for 3d perception from multi-camera images. In ICCV, 2023b.
- Segment any point cloud sequences by distilling vision foundation models. In NeurIPS, 2024.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023c.
- Self-supervised image-to-point distillation via semantically tolerant contrastive loss. In CVPR, 2023.
- Bev-guided multi-modality fusion for driving perception. In CVPR, 2023.
- Uniscene: Multi-camera unified pre-training via 3d scene reconstruction. RA-L, 2024.
- Ishan Misra and Laurens Van Der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020.
- Dinov2: Learning robust visual features without supervision. TMLR, 2023.
- Is pseudo-lidar needed for monocular 3d object detection? In ICCV, 2021.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- Three pillars improving vision foundation model distillation for lidar. In CVPR, 2024.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Categorical depth distribution network for monocular 3d object detection. In CVPR, 2021.
- Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR, 2022.
- Automatic dense visual semantic mapping from street-level imagery. In IROS, 2012.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
- Unsupervised object detection with lidar clues. In CVPR, 2021.
- Scene as occupancy. In ICCV, 2023.
- Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Drive&segment: Unsupervised semantic segmentation of urban scenes via cross-modal distillation. In ECCV, 2022.
- Pop-3d: Open-vocabulary 3d occupancy prediction from images. In NeurIPS, 2023.
- Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021.
- Sts: Surround-view temporal stereo for multi-view 3d detection. arXiv preprint arXiv:2208.10145, 2022.
- Distillbev: Boosting multi-camera 3d object detection with cross-modal knowledge distillation. In CVPR, 2023.
- M22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTbev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022.
- Robobev: Towards robust bird’s eye view perception under corruptions, 2023.
- Cape: Camera view position embedding for multi-view 3d object detection. In CVPR, 2023.
- Self-supervised representation learning from flow equivariance. In ICCV, 2021.
- Second: Sparsely embedded convolutional detection. Sensors, 2018.
- Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In CVPR, 2023a.
- Unipad: A universal pre-training paradigm for autonomous driving. In CVPR, 2024a.
- Parametric depth based feature representation learning for object detection and segmentation in bird’s-eye view. In ICCV, 2023b.
- Visual point cloud forecasting enables scalable autonomous driving. In CVPR, 2024b.
- Colorful image colorization. In ECCV, 2016.
- Cross-view transformers for real-time map-view semantic segmentation. In CVPR, 2022.
- Matrixvt: Efficient multi-camera to bev transformation for 3d perception. In ICCV, 2023.
- ibot: Image bert pre-training with online tokenizer. In ICLR, 2022.