Papers
Topics
Authors
Recent
2000 character limit reached

3D-GRES: Generalized 3D Referring Expression Segmentation (2407.20664v2)

Published 30 Jul 2024 in cs.CV

Abstract: 3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. In addressing this broader task, we propose the Multi-Query Decoupled Interaction Network (MDIN), designed to break down multi-object segmentation tasks into simpler, individual segmentations. MDIN comprises two fundamental components: Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO). TSQ generates sparse point cloud features distributed over key targets as the initialization for queries. Meanwhile, MDO is tasked with assigning each target in multi-object scenarios to different queries while maintaining their semantic consistency. To adapt to this new task, we build a new dataset, namely Multi3DRes. Our comprehensive evaluations on this dataset demonstrate substantial enhancements over existing models, thus charting a new path for intricate multi-object 3D scene comprehension. The benchmark and code are available at https://github.com/sosppxo/MDIN.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Visionโ€“ECCV 2020: 16th European Conference, Glasgow, UK, August 23โ€“28, 2020, Proceedings, Part I 16. Springer, 422โ€“440.
  2. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213โ€“229.
  3. Scanrefer: 3d object localization in rgb-d scans using natural language. In European conference on computer vision. Springer, 202โ€“221.
  4. Language conditioned spatial relation reasoning for 3d object grounding. Advances in neural information processing systems 35 (2022), 20522โ€“20535.
  5. Back-tracing representative points for voting-based 3d object detection in point clouds. In CVPR. 8963โ€“8972.
  6. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5828โ€“5839.
  7. Instructdet: Diversifying referring object detection with generalized instructions. arXiv preprint arXiv:2310.05136 (2023).
  8. Visual grounding via accumulated attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7746โ€“7755.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. Vision-language transformer and query generation for referring segmentation. In ICCV. 16321โ€“16330.
  11. VLT: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2023).
  12. Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 5980โ€“5994.
  13. Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs. In CVPR. 7641โ€“7653.
  14. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  15. Free-form description guided 3d visual graph network for object grounding in point cloud. In ICCV. 3722โ€“3731.
  16. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9224โ€“9232.
  17. Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia. 2344โ€“2352.
  18. Shuting He and Henghui Ding. 2024. RefMask3D: Language-Guided Transformer for 3D Referring Segmentation. arXiv preprint arXiv:2407.18244 (2024).
  19. SegPoint: Segment Any Point Cloud via Large Language Model. arXiv preprint arXiv:2407.13761 (2024).
  20. GREC: Generalized Referring Expression Comprehension. arXiv preprint arXiv:2308.16182 (2023).
  21. Learning to compose and reason with language tree structures for visual grounding. IEEE transactions on pattern analysis and machine intelligence 44, 2 (2019), 684โ€“696.
  22. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1115โ€“1124.
  23. Segmentation from natural language expressions. In Computer Visionโ€“ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11โ€“14, 2016, Proceedings, Part I 14. Springer, 108โ€“124.
  24. Natural language object retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4555โ€“4564.
  25. Beyond one-to-one: Rethinking the referring image segmentation. In ICCV. 4067โ€“4077.
  26. Bi-directional relationship inferring network for referring image segmentation. In CVPR. 4424โ€“4433.
  27. Text-guided graph neural networks for referring 3d instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.ย 35. 1610โ€“1618.
  28. Dense Object Grounding in 3D Scenes. In Proceedings of the 31st ACM International Conference on Multimedia. 5017โ€“5026.
  29. Two-stage visual cues enhancement network for referring image segmentation. In Proceedings of the 29th ACM international conference on multimedia. 1331โ€“1340.
  30. Locate then segment: A strong pipeline for referring image segmentation. In CVPR. 9858โ€“9867.
  31. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787โ€“798.
  32. Mask-attention-free transformer for 3d instance segmentation. In ICCV. 3693โ€“3703.
  33. Loic Landrieu and Martin Simonovsky. 2018. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4558โ€“4567.
  34. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5745โ€“5753.
  35. A Unified Framework for 3D Point Cloud Visual Grounding. arXiv:2308.11887ย [cs.CV]
  36. Gres: Generalized referring expression segmentation. In CVPR. 23592โ€“23601.
  37. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Transactions on Image Processing (2023).
  38. Instance-specific feature propagation for referring segmentation. IEEE Transactions on Multimedia (2022).
  39. Learning to assemble neural module tree networks for visual grounding. In ICCV. 4673โ€“4682.
  40. Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing (2024).
  41. CARIS: Context-aware referring image segmentation. In Proceedings of the 31st ACM International Conference on Multimedia. 779โ€“788.
  42. Improving referring expression grounding with cross-modal attention-guided erasing. In CVPR. 1950โ€“1959.
  43. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  44. Group-free 3d object detection via transformers. In ICCV. 2949โ€“2958.
  45. Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1274โ€“1282.
  46. Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR. 10034โ€“10043.
  47. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In CVPR. 16454โ€“16463.
  48. Towards local visual modeling for image captioning. Pattern Recognition 138 (2023), 109420.
  49. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 638โ€“647.
  50. X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In ICCV. 2749โ€“2760.
  51. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11โ€“20.
  52. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565โ€“571.
  53. Carsten Moenning and Neilย A Dodgson. 2003. Fast marching farthest point sampling. Technical Report. University of Cambridge, Computer Laboratory.
  54. Modeling context between objects for referring expression understanding. In Computer Visionโ€“ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11โ€“14, 2016, Proceedings, Part IV 14. Springer, 792โ€“807.
  55. Deep hough voting for 3d object detection in point clouds. In ICCV. 9277โ€“9286.
  56. X-RefSeg3D: Enhancing Referring 3D Instance Segmentation via Structured Cross-Modal Graph Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.ย 38. 4551โ€“4559.
  57. Zero-shot grounding of objects from natural language queries. In ICCV. 4694โ€“4703.
  58. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language. 70โ€“80.
  59. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence 43, 8 (2020), 2647โ€“2664.
  60. Referring expression comprehension using language adaptive inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.ย 37. 2357โ€“2365.
  61. Superpoint transformer for 3d scene instance segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.ย 37. 2393โ€“2401.
  62. Attention is all you need. Advances in neural information processing systems 30 (2017).
  63. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR. 1960โ€“1968.
  64. Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation. arXiv preprint arXiv:2312.08007 (2023).
  65. 3drp-net: 3d relative position-aware network for 3d visual grounding. arXiv preprint arXiv:2307.13363 (2023).
  66. 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation. arXiv preprint arXiv:2308.16632 (2023).
  67. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In CVPR. 6609โ€“6618.
  68. Towards robust referring image segmentation. IEEE Transactions on Image Processing (2024).
  69. NExT-GPT: Any-to-Any Multimodal LLM. In Proceedings of the International Conference on Machine Learning.
  70. Eda: Explicit text-decoupling and dense alignment for 3d visual grounding. In CVPR. 19231โ€“19242.
  71. GSVA: Generalized Segmentation via Multimodal Large Language Models. arXiv preprint arXiv:2312.10103 (2023).
  72. Described Object Detection: Liberating Object Detection with Flexible Expressions. Advances in Neural Information Processing Systems 36 (2024).
  73. Bottom-up shift and reasoning for referring image segmentation. In CVPR. 11266โ€“11275.
  74. Improving one-stage visual grounding by recursive sub-query construction. In Computer Visionโ€“ECCV 2020: 16th European Conference, Glasgow, UK, August 23โ€“28, 2020, Proceedings, Part XIV 16. Springer, 387โ€“404.
  75. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR. 18155โ€“18165.
  76. Cross-modal self-attention network for referring image segmentation. In CVPR. 10502โ€“10511.
  77. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1307โ€“1315.
  78. Modeling context in referring expressions. In Computer Visionโ€“ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer, 69โ€“85.
  79. A joint speaker-listener-reinforcer model for referring expressions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7282โ€“7290.
  80. Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. In ICCV. 1791โ€“1800.
  81. Multi3drefer: Grounding text description to multiple 3d objects. In ICCV. 15225โ€“15236.
  82. 3D object retrieval with multi-feature collaboration and bipartite graph matching. Neurocomputing 195 (2016), 40โ€“49.
  83. PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model. arXiv preprint arXiv:2403.14598 (2024).
  84. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. In ICCV. 2928โ€“2937.
  85. An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361 (2024).
  86. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252โ€“4261.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com