ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation (2311.14262v4)
Abstract: Zero-shot 3D part segmentation is a challenging and fundamental task. In this work, we propose a novel pipeline, ZeroPS, which achieves high-quality knowledge transfer from 2D pretrained foundation models (FMs), SAM and GLIP, to 3D object point clouds. We aim to explore the natural relationship between multi-view correspondence and the FMs' prompt mechanism and build bridges on it. In ZeroPS, the relationship manifests as follows: 1) lifting 2D to 3D by leveraging co-viewed regions and SAM's prompt mechanism, 2) relating 1D classes to 3D parts by leveraging 2D-3D view projection and GLIP's prompt mechanism, and 3) enhancing prediction performance by leveraging multi-view observations. Extensive evaluations on the PartNetE and AKBSeg benchmarks demonstrate that ZeroPS significantly outperforms the SOTA method across zero-shot unlabeled and instance segmentation tasks. ZeroPS does not require additional training or fine-tuning for the FMs. ZeroPS applies to both simulated and real-world data. It is hardly affected by domain shift. The project page is available at https://luis2088.github.io/ZeroPS_page.
- A 3d shape segmentation approach for robot grasping by parts. Robotics and Autonomous Systems, 60(3):358–366, 2012.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Mopa: Multi-modal prior aided domain adaptation for 3d semantic segmentation. arXiv preprint arXiv:2309.11839, 2023.
- Sad: Segment any rgbd. arXiv preprint arXiv:2305.14207, 2023.
- Bridging the domain gap: Self-supervised 3d scene understanding with foundation models. arXiv preprint arXiv:2305.08776, 2023.
- Box2mask: Weakly supervised 3d semantic instance segmentation using bounding boxes. In European Conference on Computer Vision, pages 681–699. Springer, 2022.
- Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016.
- Label-efficient learning on point clouds using approximate convex decompositions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 473–491. Springer, 2020.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Partglot: Learning shape part segmentation from language reference games. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16505–16514, 2022.
- Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8500–8509, 2022.
- Efem: Equivariant neural field expectation maximization for 3d object segmentation without scene supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4902–4912, 2023.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
- Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
- Masked discrimination for self-supervised learning on point clouds. In European Conference on Computer Vision, pages 657–675. Springer, 2022a.
- Self-prediction for joint instance and semantic segmentation of point clouds. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 187–204. Springer, 2020.
- Frame mining: a free lunch for learning robotic manipulation from 3d point clouds. arXiv preprint arXiv:2210.07442, 2022b.
- Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21736–21746, 2023.
- Autogpart: Intermediate supervision search for generalizable 3d part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11624–11634, 2022c.
- Learning to group: A bottom-up framework for 3d part discovery in unseen categories. arXiv preprint arXiv:2002.06478, 2020.
- Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019a.
- Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019b.
- Masked autoencoders for point cloud self-supervised learning. In European conference on computer vision, pages 604–621. Springer, 2022.
- Sam-guided unsupervised domain adaptation for 3d segmentation. arXiv preprint arXiv:2310.08820, 2023.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Mask3d for 3d semantic instance segmentation. arXiv preprint arXiv:2210.03105, 2022.
- Self-supervised few-shot learning on point clouds. Advances in Neural Information Processing Systems, 33:7212–7221, 2020.
- Mvdecor: Multi-view dense correspondence learning for fine-grained 3d segmentation. In European Conference on Computer Vision, pages 550–567. Springer, 2022.
- Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2022.
- Few-shot learning of part-specific probability space for 3d shape segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4504–4513, 2020.
- Ikea-manual: Seeing shape assembly step by step. Advances in Neural Information Processing Systems, 35:28428–28440, 2022.
- Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2569–2578, 2018.
- Learning fine-grained segmentation of 3d shapes without part labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10276–10285, 2021.
- Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
- Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.
- Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13706–13715, 2020.
- Unsupervised kinematic motion detection for part-segmented 3d shape collections. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
- Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023.
- A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
- Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3947–3956, 2019.
- Partnet: A recursive part decomposition network for fine-grained and hierarchical shape segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9491–9500, 2019.
- When 3d bounding-box meets sam: Point cloud instance segmentation with weak-and-noisy supervision. arXiv preprint arXiv:2309.00828, 2023.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
- Point cloud instance segmentation using probabilistic embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8883–8892, 2021.
- Growsp: Unsupervised semantic segmentation of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17619–17629, 2023.
- Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
- Divide and conquer: 3d point cloud instance segmentation with point-wise binarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 562–571, 2023.
- Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2639–2650, 2023.