Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction (2402.17766v3)

Published 27 Feb 2024 in cs.CV

Abstract: This paper presents ShapeLLM, the first 3D Multimodal LLM designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapeLLM/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (199)
  1. Learning representations and generative models for 3d point clouds. In Int. Conf. Mach. Learn. (ICML), 2018.
  2. Flamingo: a visual language model for few-shot learning. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
  3. 3d semantic parsing of large-scale indoor spaces. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2016.
  4. Sequential modeling enables scalable learning for large vision models. CoRR, abs/2312.00785, 2023.
  5. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, 2005.
  6. Improving image generation with better captions. 2023.
  7. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021.
  8. A training algorithm for optimal margin classifiers. In ACM Conf. Comput. Learn. Theory (COLT), pages 144–152. ACM, 1992.
  9. Recognition of 3-d objects from multiple 2-d views by a self-organizing neural architecture. In From Statistics to Neural Networks: Theory and Pattern Recognition Applications, pages 349–375. Springer, 1994.
  10. Shape google: Geometric words and expressions for invariant shape retrieval. ACM Trans. Graph., 30(1):1:1–1:20, 2011.
  11. Language models are few-shot learners. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2020.
  12. End-to-end object detection with transformers. In Eur. Conf. Comput. Vis. (ECCV), 2020.
  13. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
  14. GOAT: GO to any thing. CoRR, abs/2311.06430, 2023.
  15. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. CoRR, abs/2401.12168, 2023a.
  16. Scanrefer: 3d object localization in RGB-D scans using natural language. In Eur. Conf. Comput. Vis. (ECCV), 2020.
  17. Scan2cap: Context-aware dense captioning in RGB-D scans. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2021.
  18. Pointgpt: Auto-regressively generative pre-training from point clouds. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023b.
  19. Shikra: Unleashing multimodal llm’s referential dialogue magic. CoRR, abs/2306.15195, 2023c.
  20. Pali-x: On scaling up a multilingual vision and language model. In Int. Conf. Learn. Represent. (ICLR), 2023d.
  21. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  22. Abo: Dataset and benchmarks for real-world 3d object understanding. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2022.
  23. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2017.
  24. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023a.
  25. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, 2023b.
  26. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
  27. Visual dialog. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 41(5):1242–1256, 2019.
  28. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 2019.
  29. Objaverse: A universe of annotated 3d objects. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023.
  30. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In AAAI Conf. Artif. Intell. (AAAI), 2021.
  31. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  32. Lowis3d: Language-driven open-world instance-level 3d scene understanding. CoRR, abs/2308.00353, 2023a.
  33. PLA: language-driven open-vocabulary 3d scene understanding. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023b.
  34. Task and motion planning with large language models for object rearrangement. CoRR, abs/2303.06247, 2023c.
  35. Finding the task-optimal low-bit sub-distribution in deep neural networks. In Int. Conf. Mach. Learn. (ICML), 2022.
  36. Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? In Int. Conf. Learn. Represent. (ICLR), 2023.
  37. DreamLLM: Synergistic multimodal comprehension and creation. In Int. Conf. Learn. Represent. (ICLR), 2024.
  38. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent. (ICLR), 2021.
  39. Palm-e: An embodied multimodal language model. In Int. Conf. Mach. Learn. (ICML), 2023.
  40. Point transformer. IEEE Access, 9:134826–134840, 2021.
  41. A point set generation network for 3d object reconstruction from a single image. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2017.
  42. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
  43. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, 2021.
  44. Planting a SEED of vision in large language model. In Int. Conf. Learn. Represent. (ICLR), 2024.
  45. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023a.
  46. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions. CoRR, abs/2312.01307, 2023b.
  47. Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023c.
  48. Rlafford: End-to-end affordance learning for robotic manipulation. In IEEE Int. Conf. Robot. Autom. (ICRA), 2023d.
  49. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  50. ARNOLD: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes. CoRR, abs/2304.04321, 2023.
  51. Scaling and benchmarking self-supervised visual representation learning. In Int. Conf. Comput. Vis. (ICCV), pages 6390–6399. IEEE, 2019.
  52. What makes a chair a chair? In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2011.
  53. Detecting and preventing hallucinations in large vision language models. CoRR, abs/2308.06394, 2023.
  54. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. CoRR, abs/2309.00615, 2023.
  55. Lvis: A dataset for large vocabulary instance segmentation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2019.
  56. Learning rich features from RGB-D images for object detection and segmentation. In Eur. Conf. Comput. Vis. (ECCV), 2014.
  57. Visual programming: Compositional visual reasoning without training. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023.
  58. MVTN: multi-view transformation network for 3d shape recognition. In Int. Conf. Comput. Vis. (ICCV), pages 1–11. IEEE, 2021.
  59. Towards a unified view of parameter-efficient transfer learning. In Int. Conf. Learn. Represent. (ICLR), 2021.
  60. Masked autoencoders are scalable vision learners. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2022.
  61. Gaussian error linear units (gelus). CoRR, abs/1606.08415, 2016.
  62. 3d-llm: Injecting the 3d world into large language models. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023.
  63. Pri3d: Can 3d priors help 2d representation learning? In Int. Conf. Comput. Vis. (ICCV), pages 5673–5682. IEEE, 2021.
  64. Lora: Low-rank adaptation of large language models. In Int. Conf. Learn. Represent. (ICLR), 2022.
  65. Interaction context (ICON): towards a geometric functionality descriptor. ACM Trans. Graph., 34(4):83:1–83:12, 2015.
  66. Learning how objects function via co-analysis of interactions. ACM Trans. Graph., 35(4):47:1–47:13, 2016.
  67. Learning to predict part mobility from a single static snapshot. ACM Trans. Graph., 36(6):227:1–227:13, 2017.
  68. An embodied generalist agent in 3d world. CoRR, abs/2311.12871, 2023a.
  69. Clip2point: Transfer CLIP to point cloud classification with image-depth pre-training. CoRR, abs/2210.01055, 2022a.
  70. One policy to control them all: Shared modular policies for agent-agnostic control. In Int. Conf. Mach. Learn. (ICML), 2020.
  71. Inner monologue: Embodied reasoning through planning with language models. In Annu. Conf. Robot. Learn. (CoRL), 2022b.
  72. Voxposer: Composable 3d value maps for robotic manipulation with language models. In Annu. Conf. Robot. Learn. (CoRL), 2023b.
  73. Do as I can, not as I say: Grounding language in robotic affordances. In Annu. Conf. Robot. Learn. (CoRL), 2022.
  74. Openclip, 2021.
  75. Compressing llms: The truth is rarely pure and never simple. CoRR, abs/2310.01382, 2023.
  76. Visual prompt tuning. In Eur. Conf. Comput. Vis. (ECCV), 2022.
  77. VIMA: general robot manipulation with multimodal prompts. In Annu. Conf. Robot. Learn. (CoRL), 2023.
  78. How can we know what language models know. Trans. Assoc. Comput. Linguistics, 8:423–438, 2020.
  79. A stereo matching algorithm with an adaptive window: Theory and experiment. IEEE Trans. Pattern Anal. Mach. Intell., 16(9):920–932, 1994.
  80. Shape2pose: human-centric shape analysis. ACM Trans. Graph., 33(4):120:1–120:12, 2014.
  81. Generating images with multimodal language models. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023.
  82. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  83. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Int. Conf. Mach. Learn. (ICML), 2023a.
  84. Category-level articulated object pose estimation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2020.
  85. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation, 2023b.
  86. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021.
  87. Evaluating object hallucination in large vision-language models. CoRR, abs/2305.10355, 2023c.
  88. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. CoRR, abs/2303.16434, 2023.
  89. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Proc. Workshop on Text Summariation Branches Out, Post-Conference Workshop of ACL 2004, 2004.
  90. Text2motion: From natural language instructions to feasible plans. CoRR, abs/2303.12153, 2023.
  91. Aligning large multi-modal model with robust instruction tuning. CoRR, abs/2306.14565, 2023a.
  92. Improved baselines with visual instruction tuning. CoRR, abs/2310.03744, 2023b.
  93. Visual instruction tuning. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023c.
  94. Openshape: Scaling up 3d shape representation towards open-world understanding. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023d.
  95. GeneOH diffusion: Towards generalizable hand-object interaction denoising via denoising diffusion. In Int. Conf. Learn. Represent. (ICLR), 2024.
  96. Few-shot physically-aware articulated mesh generation via hierarchical deformation. In Int. Conf. Comput. Vis. (ICCV), 2023e.
  97. Self-supervised category-level articulated object pose estimation with part-level SE(3) equivariance. In Int. Conf. Learn. Represent. (ICLR), 2023f.
  98. Relation-shape convolutional neural network for point cloud analysis. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2019.
  99. Leaf: Learning frames for 4d point cloud sequence understanding. In Int. Conf. Comput. Vis. (ICCV), 2023g.
  100. Mmbench: Is your multi-modal model an all-around player? CoRR, abs/2307.06281, 2023h.
  101. Syncdreamer: Generating multiview-consistent images from a single-view image. CoRR, abs/2309.03453, 2023i.
  102. Group-free 3d object detection via transformers. In Int. Conf. Comput. Vis. (ICCV), 2021.
  103. SGDR: stochastic gradient descent with warm restarts. In Int. Conf. Learn. Represent. (ICLR), 2017.
  104. Decoupled weight decay regularization. In Int. Conf. Learn. Represent. (ICLR), 2019.
  105. Beyond holistic object recognition: Enriching image understanding with part states. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2018.
  106. OVIR-3D: open-vocabulary 3d instance retrieval without training on 3d data. In Annu. Conf. Robot. Learn. (CoRL), 2023.
  107. Scalable 3d captioning with pretrained models. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023.
  108. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. In Int. Conf. Learn. Represent. (ICLR). OpenReview.net, 2022.
  109. SQA3D: situated question answering in 3d scenes. In Int. Conf. Learn. Represent. (ICLR), 2023.
  110. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, page 5988–5999, New York, NY, USA, 2017. Association for Computing Machinery.
  111. Voxel transformer for 3d object detection. In Int. Conf. Comput. Vis. (ICCV), 2021.
  112. Voxnet: A 3d convolutional neural network for real-time object recognition. In IEEE/RSJ Int. Conf. Intell. Robot. and Syst. (IROS), pages 922–928. IEEE, 2015.
  113. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2019.
  114. OpenAI. Introducing chatgpt. 2022.
  115. OpenAI. Gpt-4v(ision) system card, 2023a.
  116. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023b.
  117. Training language models to follow instructions with human feedback. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
  118. Kosmos-g: Generating images in context with multimodal large language models. In Int. Conf. Learn. Represent. (ICLR), 2024.
  119. Masked autoencoders for point cloud self-supervised learning. In Eur. Conf. Comput. Vis. (ECCV), 2022.
  120. Bleu: a method for automatic evaluation of machine translation. 2002.
  121. Instruction tuning with GPT-4. CoRR, abs/2304.03277, 2023a.
  122. Openscene: 3d scene understanding with open vocabularies. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023b.
  123. Kosmos-2: Grounding multimodal large language models to the world. CoRR, abs/2306.14824, 2023c.
  124. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 2019.
  125. Understanding and exploiting object interaction landscapes. ACM Trans. Graph., 36(3):31:1–31:14, 2017.
  126. Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), pages 77–85, 2017a.
  127. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Adv. Neural Inform. Process. Syst. (NIPS), pages 5099–5108, 2017b.
  128. In-hand object rotation via rapid motor adaptation. In Annu. Conf. Robot. Learn. (CoRL), 2023a.
  129. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In Int. Conf. Mach. Learn. (ICML), 2023b.
  130. VPP: efficient conditional 3d generation via voxel-point progressive representation. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023c.
  131. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
  132. Improving language understanding by generative pre-training. 2018.
  133. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  134. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn. (ICML), pages 8748–8763. PMLR, 2021.
  135. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 2019.
  136. Benchmarking and analyzing point cloud classification under corruptions. In Int. Conf. Mach. Learn. (ICML), 2022.
  137. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018a.
  138. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, 2018b.
  139. Distilled feature fields enable few-shot language-guided manipulation. In Annu. Conf. Robot. Learn. (CoRL), 2023a.
  140. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023b.
  141. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. In Annu. Conf. Robot. Learn. (CoRL), 2023.
  142. Shutterstock. Turbosquid. https://www.turbosquid.com/.
  143. Multi-view convolutional neural networks for 3d shape recognition. In Int. Conf. Comput. Vis. (ICCV), 2015.
  144. Modelnet40-c: A robustness benchmark for 3d point cloud recognition under corruption. In ICLR 2022 Workshop on Socially Responsible Machine Learning.
  145. Generative multimodal models are in-context learners. CoRR, abs/2312.13286, 2023a.
  146. EVA-CLIP: improved training techniques for CLIP at scale. CoRR, abs/2303.15389, 2023b.
  147. Generative pretraining in multimodality. CoRR, abs/2307.05222, 2023c.
  148. Vipergpt: Visual inference via python execution for reasoning. CoRR, abs/2303.08128, 2023.
  149. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  150. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  151. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), pages 1588–1597, 2019.
  152. Vladimir Vapnik. Statistical learning theory. Wiley, 1998.
  153. Attention is all you need. In Adv. Neural Inform. Process. Syst. (NIPS), pages 5998–6008, 2017.
  154. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. In Int. Conf. Comput. Vis. (ICCV), 2023.
  155. Voyager: An open-ended embodied agent with large language models. CoRR, abs/2305.16291, 2023a.
  156. Normalized object coordinate space for category-level 6d object pose and size estimation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2019a.
  157. Dynamic graph CNN for learning on point clouds. ACM Trans. Graph., 38(5):146:1–146:12, 2019b.
  158. Take-a-photo: 3d-to-2d generative pre-training of point cloud models. In Int. Conf. Comput. Vis. (ICCV), 2023b.
  159. Point primitive transformer for long-term 4d point cloud video understanding. In Eur. Conf. Comput. Vis. (ECCV), 2022.
  160. CAPTRA: category-level pose tracking for rigid and articulated objects from point clouds. In Int. Conf. Comput. Vis. (ICCV), 2021.
  161. Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671, 2023a.
  162. Next-gpt: Any-to-any multimodal LLM. CoRR, abs/2309.05519, 2023b.
  163. Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation. CoRR, abs/2401.04092, 2024.
  164. 3d shapenets: A deep representation for volumetric shapes. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), pages 1912–1920, 2015.
  165. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Eur. Conf. Comput. Vis. (ECCV), pages 574–591. Springer, 2020.
  166. Pointllm: Empowering large language models to understand point clouds. CoRR, abs/2308.16911, 2023a.
  167. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023b.
  168. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers), 2023c.
  169. ULIP: learning unified representation of language, image and point cloud for 3d understanding. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023a.
  170. ULIP-2: towards scalable multimodal pre-training for 3d understanding. CoRR, abs/2305.08275, 2023b.
  171. Gpt4tools: Teaching large language model to use tools via self-instruction. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023a.
  172. MM-REACT: prompting chatgpt for multimodal reasoning and action. CoRR, abs/2303.11381, 2023b.
  173. mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023.
  174. 3d question answering. IEEE Transactions on Visualization and Computer Graphics, 2022.
  175. A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph., 35(6):1–12, 2016.
  176. Syncspeccnn: Synchronized spectral CNN for 3d shape segmentation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2017.
  177. Deep part induction from articulated object pairs. ACM Trans. Graph., 37(6):209, 2018.
  178. Make a donut: Language-guided hierarchical emd-space planning for zero-shot deformable object manipulation. CoRR, abs/2311.02787, 2023.
  179. Mm-vet: Evaluating large multimodal models for integrated capabilities. CoRR, abs/2308.02490, 2023.
  180. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2022.
  181. CLIP-FO3D: learning free open-world 3d scene representations from 2d dense CLIP. In Int. Conf. Comput. Vis. Worksh. (ICCV Workshop), 2023a.
  182. Self-distillation: Towards efficient and compact neural networks. IEEE Trans. Pattern Anal. Mach. Intell., 44(8):4388–4403, 2022a.
  183. Region-aware knowledge distillation for efficient image-to-image translation. In Brit. Mach. Vis. Conf. (BMVC), 2023b.
  184. Pointdistiller: Structured knowledge distillation towards efficient and compact 3d detection. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023c.
  185. Point-m2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022b.
  186. Pointclip: Point cloud understanding by CLIP. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2022c.
  187. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. CoRR, abs/2303.16199, 2023d.
  188. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023e.
  189. Gpt4roi: Instruction tuning large language model on region-of-interest. CoRR, abs/2307.03601, 2023f.
  190. Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219, 2023g.
  191. Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. CoRR, abs/2307.09474, 2023.
  192. Indexing 3d scenes using the interaction bisector surface. ACM Trans. Graph., 33(3):22:1–22:14, 2014.
  193. CAMS: canonicalized manipulation spaces for category-level functional hand-object manipulation synthesis. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023a.
  194. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685, 2023b.
  195. Uni3d: Exploring unified 3d representation at scale. In Int. Conf. Learn. Represent. (ICLR), 2024a.
  196. Analyzing and mitigating object hallucination in large vision-language models. In Int. Conf. Learn. Represent. (ICLR), 2024b.
  197. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023a.
  198. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In Int. Conf. Comput. Vis. (ICCV), 2023b.
  199. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Int. Conf. Comput. Vis. (ICCV), 2023c.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zekun Qi (10 papers)
  2. Runpei Dong (21 papers)
  3. Shaochen Zhang (4 papers)
  4. Haoran Geng (30 papers)
  5. Chunrui Han (21 papers)
  6. Zheng Ge (60 papers)
  7. Li Yi (111 papers)
  8. Kaisheng Ma (46 papers)
Citations (27)

Summary

We haven't generated a summary for this paper yet.