3D-GRES: Generalized 3D Referring Expression Segmentation (2407.20664v2)
Abstract: 3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. In addressing this broader task, we propose the Multi-Query Decoupled Interaction Network (MDIN), designed to break down multi-object segmentation tasks into simpler, individual segmentations. MDIN comprises two fundamental components: Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO). TSQ generates sparse point cloud features distributed over key targets as the initialization for queries. Meanwhile, MDO is tasked with assigning each target in multi-object scenarios to different queries while maintaining their semantic consistency. To adapt to this new task, we build a new dataset, namely Multi3DRes. Our comprehensive evaluations on this dataset demonstrate substantial enhancements over existing models, thus charting a new path for intricate multi-object 3D scene comprehension. The benchmark and code are available at https://github.com/sosppxo/MDIN.
- Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer VisionโECCV 2020: 16th European Conference, Glasgow, UK, August 23โ28, 2020, Proceedings, Part I 16. Springer, 422โ440.
- End-to-end object detection with transformers. In European conference on computer vision. Springer, 213โ229.
- Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision. Springer, 202โ221.
- Language conditioned spatial relation reasoning for 3d object grounding. Advances in neural information processing systems 35 (2022), 20522โ20535.
- Back-tracing representative points for voting-based 3d object detection in point clouds. In CVPR. 8963โ8972.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828โ5839.
- Instructdet: Diversifying referring object detection with generalized instructions. arXiv preprint arXiv:2310.05136 (2023).
- Visual grounding via accumulated attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7746โ7755.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Vision-language transformer and query generation for referring segmentation. In ICCV. 16321โ16330.
- VLT: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2023).
- Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 5980โ5994.
- Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs. In CVPR. 7641โ7653.
- Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
- Free-form description guided 3d visual graph network for object grounding in point cloud. In ICCV. 3722โ3731.
- 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9224โ9232.
- Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia. 2344โ2352.
- Shuting He and Henghui Ding. 2024. RefMask3D: Language-Guided Transformer for 3D Referring Segmentation. arXiv preprint arXiv:2407.18244 (2024).
- SegPoint: Segment Any Point Cloud via Large Language Model. arXiv preprint arXiv:2407.13761 (2024).
- GREC: Generalized Referring Expression Comprehension. arXiv preprint arXiv:2308.16182 (2023).
- Learning to compose and reason with language tree structures for visual grounding. IEEE transactions on pattern analysis and machine intelligence 44, 2 (2019), 684โ696.
- Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1115โ1124.
- Segmentation from natural language expressions. In Computer VisionโECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11โ14, 2016, Proceedings, Part I 14. Springer, 108โ124.
- Natural language object retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4555โ4564.
- Beyond one-to-one: Rethinking the referring image segmentation. In ICCV. 4067โ4077.
- Bi-directional relationship inferring network for referring image segmentation. In CVPR. 4424โ4433.
- Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.ย 35. 1610โ1618.
- Dense Object Grounding in 3D Scenes. In Proceedings of the 31st ACM International Conference on Multimedia. 5017โ5026.
- Two-stage visual cues enhancement network for referring image segmentation. In Proceedings of the 29th ACM international conference on multimedia. 1331โ1340.
- Locate then segment: A strong pipeline for referring image segmentation. In CVPR. 9858โ9867.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787โ798.
- Mask-attention-free transformer for 3d instance segmentation. In ICCV. 3693โ3703.
- Loic Landrieu and Martin Simonovsky. 2018. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4558โ4567.
- Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5745โ5753.
- A Unified Framework for 3D Point Cloud Visual Grounding. arXiv:2308.11887ย [cs.CV]
- Gres: Generalized referring expression segmentation. In CVPR. 23592โ23601.
- Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing (2023).
- Instance-specific feature propagation for referring segmentation. IEEE Transactions on Multimedia (2022).
- Learning to assemble neural module tree networks for visual grounding. In ICCV. 4673โ4682.
- Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing (2024).
- CARIS: Context-aware referring image segmentation. In Proceedings of the 31st ACM International Conference on Multimedia. 779โ788.
- Improving referring expression grounding with cross-modal attention-guided erasing. In CVPR. 1950โ1959.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Group-free 3d object detection via transformers. In ICCV. 2949โ2958.
- Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1274โ1282.
- Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR. 10034โ10043.
- 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In CVPR. 16454โ16463.
- Towards local visual modeling for image captioning. Pattern Recognition 138 (2023), 109420.
- X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 638โ647.
- X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In ICCV. 2749โ2760.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11โ20.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565โ571.
- Carsten Moenning and Neilย A Dodgson. 2003. Fast marching farthest point sampling. Technical Report. University of Cambridge, Computer Laboratory.
- Modeling context between objects for referring expression understanding. In Computer VisionโECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11โ14, 2016, Proceedings, Part IV 14. Springer, 792โ807.
- Deep hough voting for 3d object detection in point clouds. In ICCV. 9277โ9286.
- X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.ย 38. 4551โ4559.
- Zero-shot grounding of objects from natural language queries. In ICCV. 4694โ4703.
- Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language. 70โ80.
- From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence 43, 8 (2020), 2647โ2664.
- Referring expression comprehension using language adaptive inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.ย 37. 2357โ2365.
- Superpoint transformer for 3d scene instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.ย 37. 2393โ2401.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR. 1960โ1968.
- Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation. arXiv preprint arXiv:2312.08007 (2023).
- 3drp-net: 3d relative position-aware network for 3d visual grounding. arXiv preprint arXiv:2307.13363 (2023).
- 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation. arXiv preprint arXiv:2308.16632 (2023).
- Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In CVPR. 6609โ6618.
- Towards robust referring image segmentation. IEEE Transactions on Image Processing (2024).
- NExT-GPT: Any-to-Any Multimodal LLM. In Proceedings of the International Conference on Machine Learning.
- Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In CVPR. 19231โ19242.
- GSVA: Generalized Segmentation via Multimodal Large Language Models. arXiv preprint arXiv:2312.10103 (2023).
- Described Object Detection: Liberating Object Detection with Flexible Expressions. Advances in Neural Information Processing Systems 36 (2024).
- Bottom-up shift and reasoning for referring image segmentation. In CVPR. 11266โ11275.
- Improving one-stage visual grounding by recursive sub-query construction. In Computer VisionโECCV 2020: 16th European Conference, Glasgow, UK, August 23โ28, 2020, Proceedings, Part XIV 16. Springer, 387โ404.
- Lavt: Language-aware vision transformer for referring image segmentation. In CVPR. 18155โ18165.
- Cross-modal self-attention network for referring image segmentation. In CVPR. 10502โ10511.
- Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1307โ1315.
- Modeling context in referring expressions. In Computer VisionโECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69โ85.
- A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7282โ7290.
- Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In ICCV. 1791โ1800.
- Multi3drefer: Grounding text description to multiple 3d objects. In ICCV. 15225โ15236.
- 3D object retrieval with multi-feature collaboration and bipartite graph matching. Neurocomputing 195 (2016), 40โ49.
- PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model. arXiv preprint arXiv:2403.14598 (2024).
- 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV. 2928โ2937.
- An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361 (2024).
- Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252โ4261.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.