3D Unsupervised Learning by Distilling 2D Open-Vocabulary Segmentation Models for Autonomous Driving (2405.15286v2)
Abstract: Point cloud data labeling is considered a time-consuming and expensive task in autonomous driving, whereas unsupervised learning can avoid it by learning point cloud representations from unannotated data. In this paper, we propose UOV, a novel 3D Unsupervised framework assisted by 2D Open-Vocabulary segmentation models. It consists of two stages: In the first stage, we innovatively integrate high-quality textual and image features of 2D open-vocabulary models and propose the Tri-Modal contrastive Pre-training (TMP). In the second stage, spatial mapping between point clouds and images is utilized to generate pseudo-labels, enabling cross-modal knowledge distillation. Besides, we introduce the Approximate Flat Interaction (AFI) to address the noise during alignment and label confusion. To validate the superiority of UOV, extensive experiments are conducted on multiple related datasets. We achieved a record-breaking 47.73% mIoU on the annotation-free point cloud segmentation task in nuScenes, surpassing the previous best model by 10.70% mIoU. Meanwhile, the performance of fine-tuning with 1% data on nuScenes and SemanticKITTI reached a remarkable 51.75% mIoU and 48.14% mIoU, outperforming all previous pre-trained models.
- Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019.
- Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In CVPR, 2020.
- Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, 2018.
- Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Clip2scene: Towards label-efficient 3d scene understanding by clip. In CVPR, 2023.
- Clip2: Contrastive language-image-point pretraining from real-world point cloud data. In CVPR, 2023.
- Towards label-free scene understanding by vision foundation models. arXiv preprint arXiv:2306.03899, 2023.
- Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Segment anything. 2023.
- Extract free dense labels from clip. In ECCV, 2022.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487, 2023.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023.
- Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. arXiv preprint arXiv:2303.11797, 2023.
- Segment everything everywhere all at once. In NeurIPS, 2024.
- Generalized decoding for pixel, image, and language. In CVPR, 2023.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
- Open vocabulary scene parsing. In ICCV, 2017.
- Semantic projection network for zero-and few-label semantic segmentation. In CVPR, 2019.
- Zero-shot semantic segmentation. In NeurIPS, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Language-driven semantic segmentation, 2022.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Decoupling zero-shot semantic segmentation. In CVPR, 2022.
- A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, 2022.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
- Also: Automotive lidar self-supervision by occupancy estimation. In CVPR, 2023.
- Bev-mae: Bird’s eye view masked autoencoders for outdoor point cloud pre-training. arXiv preprint arXiv:2212.05758, 2022.
- Masked autoencoder for self-supervised pre-training on lidar point clouds. In WACV, 2023.
- Occupancy-mae: Self-supervised pre-training large-scale lidar point clouds with masked occupancy autoencoders. TIV, 2023.
- Simipu: Simple 2d image and 3d point cloud unsupervised pre-training for spatial-aware visual representations. In AAAI, 2022.
- Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
- Self-supervised pretraining of 3d features on any point-cloud. In ICCV, 2021.
- Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining. arXiv preprint arXiv:2104.04687, 2021.
- Image-to-lidar self-supervised distillation for autonomous driving data. In CVPR, 2022.
- Self-supervised image-to-point distillation via semantically tolerant contrastive loss. In CVPR, 2023.
- Segment any point cloud sequences by distilling vision foundation models. arXiv preprint arXiv:2306.09347, 2023.
- Segcontrast: 3d point cloud feature representation learning through self-supervised segment discrimination. RAL, 2022.
- Pla: Language-driven open-vocabulary 3d scene understanding. In CVPR, 2023.
- Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
- Slic superpixels compared to state-of-the-art superpixel methods. TPAMI, 2012.
- Deep residual learning for image recognition. In CVPR, 2016.
- Parameter is not all you need: Starting from non-parametric networks for 3d point cloud analysis. arXiv preprint arXiv:2303.08134, 2023.
- Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, 2020.
- Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, 2019.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
- The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, 2018.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Boyi Sun (3 papers)
- Yuhang Liu (57 papers)
- Xingxia Wang (1 paper)
- Bin Tian (11 papers)
- Long Chen (395 papers)
- Fei-Yue Wang (72 papers)