Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OccFeat: Self-supervised Occupancy Feature Prediction for Pretraining BEV Segmentation Networks (2404.14027v3)

Published 22 Apr 2024 in cs.CV and cs.LG

Abstract: We introduce a self-supervised pretraining method, called OccFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach. Repository: https://github.com/valeoai/Occfeat

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Stretchbev: Stretching future instance prediction spatially and temporally. In ECCV, 2022.
  2. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
  3. Learning by reconstruction produces uninformative features for perception. arXiv preprint arXiv:2402.11337, 2024.
  4. BEiT: BERT pre-training of image transformers. In ICLR, 2022.
  5. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.
  6. Lara: Latents and rays for multi-camera bird’s-eye-view semantic segmentation. In CoRL, 2022.
  7. Also: Automotive lidar self-supervision by occupancy estimation. In CVPR, 2023.
  8. Plop: Probabilistic polynomial objects trajectory planning for autonomous driving. In CoRL, 2020.
  9. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  10. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  11. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
  12. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  13. Multisiam: Self-supervised multi-instance siamese representation learning for autonomous driving. In ICCV, 2021.
  14. Polar parametrization for vision-based surround-view 3d detection. arXiv preprint arXiv:2206.10965, 2022.
  15. A simple framework for contrastive learning of visual representations. In ICML, 2020a.
  16. Exploring simple siamese representation learning. In CVPR, 2021.
  17. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
  18. Bevdistill: Cross-modal bev distillation for multi-view 3d object detection. In ICLR, 2023.
  19. Unsupervised visual representation learning by context prediction. In CVPR, 2015.
  20. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
  21. Obow: Online bag-of-visual-words generation for self-supervised learning. In CVPR, 2021.
  22. Moca: Self-supervised representation learning by predicting masked online codebook assignments. TMLR, 2024.
  23. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
  24. Simple-bev: What really matters for multi-sensor bev perception? In ICRA, 2023.
  25. Deep residual learning for image recognition. In CVPR, 2016.
  26. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  27. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  28. Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In ICML, 2020.
  29. Fishing net: Future inference of semantic heatmaps in grids. In CVPR, 2020.
  30. Cross-modality knowledge distillation network for monocular 3d object detection. In ECCV, 2022.
  31. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021.
  32. Planning-oriented autonomous driving. In CVPR, 2023.
  33. Bevpoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111, 2022.
  34. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  35. Geometric-aware pretraining for vision-centric 3d object detection. arXiv preprint arXiv:2304.03105, 2023a.
  36. Tig-bev: Multi-view bev 3d object detection via target inner-geometry learning. arXiv preprint arXiv:2212.13979, 2022.
  37. Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, 2023b.
  38. Polarformer: Multi-camera 3d object detection with polar transformer. In AAAI, 2023.
  39. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022a.
  40. Bi-mapper: Holistic bev semantic mapping for autonomous driving. RA-L, 2023a.
  41. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In AAAI, 2023b.
  42. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, 2023c.
  43. Fast-bev: A fast and strong bird’s-eye view perception baseline. arXiv preprint arXiv:2301.12511, 2023d.
  44. Bevstereo++: Accurate depth estimation in multi-view 3d object detection via dynamic temporal stereo. arXiv preprint arXiv:2304.04185, 2023e.
  45. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022b.
  46. Fb-bev: Bev representation from forward-backward view transformations. In ICCV, 2023f.
  47. Geomim: Towards better 3d knowledge transfer via masked image modeling for multi-view 3d understanding. In CVPR, 2023a.
  48. Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, 2022.
  49. Petrv2: A unified framework for 3d perception from multi-camera images. In ICCV, 2023b.
  50. Segment any point cloud sequences by distilling vision foundation models. In NeurIPS, 2024.
  51. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023c.
  52. Self-supervised image-to-point distillation via semantically tolerant contrastive loss. In CVPR, 2023.
  53. Bev-guided multi-modality fusion for driving perception. In CVPR, 2023.
  54. Uniscene: Multi-camera unified pre-training via 3d scene reconstruction. RA-L, 2024.
  55. Ishan Misra and Laurens Van Der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020.
  56. Dinov2: Learning robust visual features without supervision. TMLR, 2023.
  57. Is pseudo-lidar needed for monocular 3d object detection? In ICCV, 2021.
  58. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
  59. Three pillars improving vision foundation model distillation for lidar. In CVPR, 2024.
  60. Learning transferable visual models from natural language supervision. In ICML, 2021.
  61. Categorical depth distribution network for monocular 3d object detection. In CVPR, 2021.
  62. Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR, 2022.
  63. Automatic dense visual semantic mapping from street-level imagery. In IROS, 2012.
  64. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  65. Unsupervised object detection with lidar clues. In CVPR, 2021.
  66. Scene as occupancy. In ICCV, 2023.
  67. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  68. Drive&segment: Unsupervised semantic segmentation of urban scenes via cross-modal distillation. In ECCV, 2022.
  69. Pop-3d: Open-vocabulary 3d occupancy prediction from images. In NeurIPS, 2023.
  70. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021.
  71. Sts: Surround-view temporal stereo for multi-view 3d detection. arXiv preprint arXiv:2208.10145, 2022.
  72. Distillbev: Boosting multi-camera 3d object detection with cross-modal knowledge distillation. In CVPR, 2023.
  73. M22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTbev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022.
  74. Robobev: Towards robust bird’s eye view perception under corruptions, 2023.
  75. Cape: Camera view position embedding for multi-view 3d object detection. In CVPR, 2023.
  76. Self-supervised representation learning from flow equivariance. In ICCV, 2021.
  77. Second: Sparsely embedded convolutional detection. Sensors, 2018.
  78. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In CVPR, 2023a.
  79. Unipad: A universal pre-training paradigm for autonomous driving. In CVPR, 2024a.
  80. Parametric depth based feature representation learning for object detection and segmentation in bird’s-eye view. In ICCV, 2023b.
  81. Visual point cloud forecasting enables scalable autonomous driving. In CVPR, 2024b.
  82. Colorful image colorization. In ECCV, 2016.
  83. Cross-view transformers for real-time map-view semantic segmentation. In CVPR, 2022.
  84. Matrixvt: Efficient multi-camera to bev transformation for 3d perception. In ICCV, 2023.
  85. ibot: Image bert pre-training with online tokenizer. In ICLR, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.