POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images (2401.09413v1)
Abstract: We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The contributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-LLM without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes. You can find the project page here https://vobecant.github.io/POP3D.
- Self-supervised object detection from audio-visual correspondence. In CVPR, 2022.
- Self-supervised multimodal versatile networks. In NeurIPS, 2020.
- Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, 2020.
- Look, listen and learn. In ICCV, 2017.
- Lara: Latents and rays for multi-camera bird’s-eye-view semantic segmentation. In CoRL, 2022.
- The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, 2018.
- Also: Automotive lidar self-supervision by occupancy estimation. CVPR, 2022.
- Unstructured point cloud semantic labeling using deep segmentation networks. In EurographicsW, 2017.
- Zero-shot semantic segmentation. In NeurIPS, 2019.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Semantic scene completion via integrating instances and scene in-the-loop. In CVPR, 2021.
- Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In CVPR, 2022.
- Efficient geometry-aware 3d generative adversarial networks. In CVPR, 2022.
- Localizing visual sounds the hard way. In CVPR, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
- BSP-Net: generating compact meshes via binary space partitioning. In CVPR, 2020.
- Learning priors for semantic 3d reconstruction. In ECCV, 2018.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
- Virtex: Learning visual representations from textual annotations. In CVPR, 2021.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In CVPR, 2017.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Contrastive learning for weakly supervised phrase grounding. In ECCV, 2020.
- Uncertainty-aware learning for zero-shot semantic segmentation. In NeurIPS, 2020.
- BEVDet: high-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, 2023.
- ConceptFusion: open-set multimodal 3d mapping. In RSS, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- LERF: language embedded radiance fields. In ICCV, 2023.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, ICLR, 2015.
- Segment anything. In ICCV, 2023.
- Deep projective 3d semantic segmentation. In ICAIP, 2017.
- Language-driven semantic segmentation. In ICLR, 2022.
- Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020.
- Anisotropic convolutional networks for 3d semantic scene completion. In CVPR, 2020.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, 2023.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022.
- Open-vocabulary semantic segmentation with mask-adapted CLIP. In CVPR, 2022.
- hdbscan: Hierarchical density based clustering. JOSS, 2017.
- End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
- Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
- Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Semantic scene completion using local deep implicit functions on lidar data. TPAMI, 2021.
- Lmscnet: Lightweight multiscale 3d semantic completion. In 3DV, 2020.
- 3d semantic scene completion: A survey. IJCV, 2022.
- Learning visual representations with caption annotations. In ECCV, 2020.
- Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, 2020.
- Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
- Unsupervised object detection with lidar clues. In CVPR, 2021.
- Drive&segment: Unsupervised semantic segmentation of urban scenes via cross-modal distillation. In ECCV, 2022.
- Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021.
- Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In ICRA, 2019.
- SCFusion: real-time incremental scene reconstruction with semantic completion. In 3DV, 2020.
- Semantic projection network for zero-and few-label semantic segmentation. In CVPR, 2019.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
- 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In ECCV, 2022.
- Regionclip: Region-based language-image pretraining. In CVPR, 2022.
- Cross-view transformers for real-time map-view semantic segmentation. In CVPR, 2022.
- Extract free dense labels from clip. In ECCV, 2022.
- Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.