POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images (2401.09413v1)
Abstract: We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-LLM without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes. You can find the project page here https://vobecant.github.io/POP3D.
- Self-supervised object detection from audio-visual correspondence. In CVPR, 2022.
- Self-supervised multimodal versatile networks. In NeurIPS, 2020.
- Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, 2020.
- Look, listen and learn. In ICCV, 2017.
- Lara: Latents and rays for multi-camera bird’s-eye-view semantic segmentation. In CoRL, 2022.
- The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, 2018.
- Also: Automotive lidar self-supervision by occupancy estimation. CVPR, 2022.
- Unstructured point cloud semantic labeling using deep segmentation networks. In EurographicsW, 2017.
- Zero-shot semantic segmentation. In NeurIPS, 2019.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Semantic scene completion via integrating instances and scene in-the-loop. In CVPR, 2021.
- Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, 2022.
- Efficient geometry-aware 3d generative adversarial networks. In CVPR, 2022.
- Localizing visual sounds the hard way. In CVPR, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
- BSP-Net: generating compact meshes via binary space partitioning. In CVPR, 2020.
- Learning priors for semantic 3d reconstruction. In ECCV, 2018.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
- Virtex: Learning visual representations from textual annotations. In CVPR, 2021.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In CVPR, 2017.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Contrastive learning for weakly supervised phrase grounding. In ECCV, 2020.
- Uncertainty-aware learning for zero-shot semantic segmentation. In NeurIPS, 2020.
- BEVDet: high-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, 2023.
- ConceptFusion: open-set multimodal 3d mapping. In RSS, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- LERF: language embedded radiance fields. In ICCV, 2023.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, ICLR, 2015.
- Segment anything. In ICCV, 2023.
- Deep projective 3d semantic segmentation. In ICAIP, 2017.
- Language-driven semantic segmentation. In ICLR, 2022.
- Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020.
- Anisotropic convolutional networks for 3d semantic scene completion. In CVPR, 2020.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, 2023.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022.
- Open-vocabulary semantic segmentation with mask-adapted CLIP. In CVPR, 2022.
- hdbscan: Hierarchical density based clustering. JOSS, 2017.
- End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
- Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
- Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Semantic scene completion using local deep implicit functions on lidar data. TPAMI, 2021.
- Lmscnet: Lightweight multiscale 3d semantic completion. In 3DV, 2020.
- 3d semantic scene completion: A survey. IJCV, 2022.
- Learning visual representations with caption annotations. In ECCV, 2020.
- Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, 2020.
- Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
- Unsupervised object detection with lidar clues. In CVPR, 2021.
- Drive&segment: Unsupervised semantic segmentation of urban scenes via cross-modal distillation. In ECCV, 2022.
- Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021.
- Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, 2019.
- SCFusion: real-time incremental scene reconstruction with semantic completion. In 3DV, 2020.
- Semantic projection network for zero-and few-label semantic segmentation. In CVPR, 2019.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
- 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In ECCV, 2022.
- Regionclip: Region-based language-image pretraining. In CVPR, 2022.
- Cross-view transformers for real-time map-view semantic segmentation. In CVPR, 2022.
- Extract free dense labels from clip. In ECCV, 2022.
- Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, 2021.