ShapeLLM: Universal 3D Object Understanding for Embodied Interaction (2402.17766v3)
Abstract: This paper presents ShapeLLM, the first 3D Multimodal LLM designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapeLLM/
- Learning representations and generative models for 3d point clouds. In Int. Conf. Mach. Learn. (ICML), 2018.
- Flamingo: a visual language model for few-shot learning. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
- 3d semantic parsing of large-scale indoor spaces. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2016.
- Sequential modeling enables scalable learning for large vision models. CoRR, abs/2312.00785, 2023.
- METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, 2005.
- Improving image generation with better captions. 2023.
- On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021.
- A training algorithm for optimal margin classifiers. In ACM Conf. Comput. Learn. Theory (COLT), pages 144–152. ACM, 1992.
- Recognition of 3-d objects from multiple 2-d views by a self-organizing neural architecture. In From Statistics to Neural Networks: Theory and Pattern Recognition Applications, pages 349–375. Springer, 1994.
- Shape google: Geometric words and expressions for invariant shape retrieval. ACM Trans. Graph., 30(1):1:1–1:20, 2011.
- Language models are few-shot learners. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2020.
- End-to-end object detection with transformers. In Eur. Conf. Comput. Vis. (ECCV), 2020.
- Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
- GOAT: GO to any thing. CoRR, abs/2311.06430, 2023.
- Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. CoRR, abs/2401.12168, 2023a.
- Scanrefer: 3d object localization in RGB-D scans using natural language. In Eur. Conf. Comput. Vis. (ECCV), 2020.
- Scan2cap: Context-aware dense captioning in RGB-D scans. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2021.
- Pointgpt: Auto-regressively generative pre-training from point clouds. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023b.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. CoRR, abs/2306.15195, 2023c.
- Pali-x: On scaling up a multilingual vision and language model. In Int. Conf. Learn. Represent. (ICLR), 2023d.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Abo: Dataset and benchmarks for real-world 3d object understanding. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2022.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2017.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023a.
- Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, 2023b.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
- Visual dialog. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 41(5):1242–1256, 2019.
- Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 2019.
- Objaverse: A universe of annotated 3d objects. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023.
- Voxel r-cnn: Towards high performance voxel-based 3d object detection. In AAAI Conf. Artif. Intell. (AAAI), 2021.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
- Lowis3d: Language-driven open-world instance-level 3d scene understanding. CoRR, abs/2308.00353, 2023a.
- PLA: language-driven open-vocabulary 3d scene understanding. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023b.
- Task and motion planning with large language models for object rearrangement. CoRR, abs/2303.06247, 2023c.
- Finding the task-optimal low-bit sub-distribution in deep neural networks. In Int. Conf. Mach. Learn. (ICML), 2022.
- Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? In Int. Conf. Learn. Represent. (ICLR), 2023.
- DreamLLM: Synergistic multimodal comprehension and creation. In Int. Conf. Learn. Represent. (ICLR), 2024.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent. (ICLR), 2021.
- Palm-e: An embodied multimodal language model. In Int. Conf. Mach. Learn. (ICML), 2023.
- Point transformer. IEEE Access, 9:134826–134840, 2021.
- A point set generation network for 3d object reconstruction from a single image. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2017.
- 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
- Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, 2021.
- Planting a SEED of vision in large language model. In Int. Conf. Learn. Represent. (ICLR), 2024.
- Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023a.
- Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions. CoRR, abs/2312.01307, 2023b.
- Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023c.
- Rlafford: End-to-end affordance learning for robotic manipulation. In IEEE Int. Conf. Robot. Autom. (ICRA), 2023d.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- ARNOLD: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes. CoRR, abs/2304.04321, 2023.
- Scaling and benchmarking self-supervised visual representation learning. In Int. Conf. Comput. Vis. (ICCV), pages 6390–6399. IEEE, 2019.
- What makes a chair a chair? In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2011.
- Detecting and preventing hallucinations in large vision language models. CoRR, abs/2308.06394, 2023.
- Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. CoRR, abs/2309.00615, 2023.
- Lvis: A dataset for large vocabulary instance segmentation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2019.
- Learning rich features from RGB-D images for object detection and segmentation. In Eur. Conf. Comput. Vis. (ECCV), 2014.
- Visual programming: Compositional visual reasoning without training. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023.
- MVTN: multi-view transformation network for 3d shape recognition. In Int. Conf. Comput. Vis. (ICCV), pages 1–11. IEEE, 2021.
- Towards a unified view of parameter-efficient transfer learning. In Int. Conf. Learn. Represent. (ICLR), 2021.
- Masked autoencoders are scalable vision learners. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2022.
- Gaussian error linear units (gelus). CoRR, abs/1606.08415, 2016.
- 3d-llm: Injecting the 3d world into large language models. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023.
- Pri3d: Can 3d priors help 2d representation learning? In Int. Conf. Comput. Vis. (ICCV), pages 5673–5682. IEEE, 2021.
- Lora: Low-rank adaptation of large language models. In Int. Conf. Learn. Represent. (ICLR), 2022.
- Interaction context (ICON): towards a geometric functionality descriptor. ACM Trans. Graph., 34(4):83:1–83:12, 2015.
- Learning how objects function via co-analysis of interactions. ACM Trans. Graph., 35(4):47:1–47:13, 2016.
- Learning to predict part mobility from a single static snapshot. ACM Trans. Graph., 36(6):227:1–227:13, 2017.
- An embodied generalist agent in 3d world. CoRR, abs/2311.12871, 2023a.
- Clip2point: Transfer CLIP to point cloud classification with image-depth pre-training. CoRR, abs/2210.01055, 2022a.
- One policy to control them all: Shared modular policies for agent-agnostic control. In Int. Conf. Mach. Learn. (ICML), 2020.
- Inner monologue: Embodied reasoning through planning with language models. In Annu. Conf. Robot. Learn. (CoRL), 2022b.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. In Annu. Conf. Robot. Learn. (CoRL), 2023b.
- Do as I can, not as I say: Grounding language in robotic affordances. In Annu. Conf. Robot. Learn. (CoRL), 2022.
- Openclip, 2021.
- Compressing llms: The truth is rarely pure and never simple. CoRR, abs/2310.01382, 2023.
- Visual prompt tuning. In Eur. Conf. Comput. Vis. (ECCV), 2022.
- VIMA: general robot manipulation with multimodal prompts. In Annu. Conf. Robot. Learn. (CoRL), 2023.
- How can we know what language models know. Trans. Assoc. Comput. Linguistics, 8:423–438, 2020.
- A stereo matching algorithm with an adaptive window: Theory and experiment. IEEE Trans. Pattern Anal. Mach. Intell., 16(9):920–932, 1994.
- Shape2pose: human-centric shape analysis. ACM Trans. Graph., 33(4):120:1–120:12, 2014.
- Generating images with multimodal language models. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Int. Conf. Mach. Learn. (ICML), 2023a.
- Category-level articulated object pose estimation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2020.
- Manipllm: Embodied multimodal large language model for object-centric robotic manipulation, 2023b.
- Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021.
- Evaluating object hallucination in large vision-language models. CoRR, abs/2305.10355, 2023c.
- Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. CoRR, abs/2303.16434, 2023.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Proc. Workshop on Text Summariation Branches Out, Post-Conference Workshop of ACL 2004, 2004.
- Text2motion: From natural language instructions to feasible plans. CoRR, abs/2303.12153, 2023.
- Aligning large multi-modal model with robust instruction tuning. CoRR, abs/2306.14565, 2023a.
- Improved baselines with visual instruction tuning. CoRR, abs/2310.03744, 2023b.
- Visual instruction tuning. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023c.
- Openshape: Scaling up 3d shape representation towards open-world understanding. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023d.
- GeneOH diffusion: Towards generalizable hand-object interaction denoising via denoising diffusion. In Int. Conf. Learn. Represent. (ICLR), 2024.
- Few-shot physically-aware articulated mesh generation via hierarchical deformation. In Int. Conf. Comput. Vis. (ICCV), 2023e.
- Self-supervised category-level articulated object pose estimation with part-level SE(3) equivariance. In Int. Conf. Learn. Represent. (ICLR), 2023f.
- Relation-shape convolutional neural network for point cloud analysis. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2019.
- Leaf: Learning frames for 4d point cloud sequence understanding. In Int. Conf. Comput. Vis. (ICCV), 2023g.
- Mmbench: Is your multi-modal model an all-around player? CoRR, abs/2307.06281, 2023h.
- Syncdreamer: Generating multiview-consistent images from a single-view image. CoRR, abs/2309.03453, 2023i.
- Group-free 3d object detection via transformers. In Int. Conf. Comput. Vis. (ICCV), 2021.
- SGDR: stochastic gradient descent with warm restarts. In Int. Conf. Learn. Represent. (ICLR), 2017.
- Decoupled weight decay regularization. In Int. Conf. Learn. Represent. (ICLR), 2019.
- Beyond holistic object recognition: Enriching image understanding with part states. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2018.
- OVIR-3D: open-vocabulary 3d instance retrieval without training on 3d data. In Annu. Conf. Robot. Learn. (CoRL), 2023.
- Scalable 3d captioning with pretrained models. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023.
- Rethinking network design and local geometry in point cloud: A simple residual MLP framework. In Int. Conf. Learn. Represent. (ICLR). OpenReview.net, 2022.
- SQA3D: situated question answering in 3d scenes. In Int. Conf. Learn. Represent. (ICLR), 2023.
- Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, page 5988–5999, New York, NY, USA, 2017. Association for Computing Machinery.
- Voxel transformer for 3d object detection. In Int. Conf. Comput. Vis. (ICCV), 2021.
- Voxnet: A 3d convolutional neural network for real-time object recognition. In IEEE/RSJ Int. Conf. Intell. Robot. and Syst. (IROS), pages 922–928. IEEE, 2015.
- Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2019.
- OpenAI. Introducing chatgpt. 2022.
- OpenAI. Gpt-4v(ision) system card, 2023a.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023b.
- Training language models to follow instructions with human feedback. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
- Kosmos-g: Generating images in context with multimodal large language models. In Int. Conf. Learn. Represent. (ICLR), 2024.
- Masked autoencoders for point cloud self-supervised learning. In Eur. Conf. Comput. Vis. (ECCV), 2022.
- Bleu: a method for automatic evaluation of machine translation. 2002.
- Instruction tuning with GPT-4. CoRR, abs/2304.03277, 2023a.
- Openscene: 3d scene understanding with open vocabularies. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023b.
- Kosmos-2: Grounding multimodal large language models to the world. CoRR, abs/2306.14824, 2023c.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 2019.
- Understanding and exploiting object interaction landscapes. ACM Trans. Graph., 36(3):31:1–31:14, 2017.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), pages 77–85, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Adv. Neural Inform. Process. Syst. (NIPS), pages 5099–5108, 2017b.
- In-hand object rotation via rapid motor adaptation. In Annu. Conf. Robot. Learn. (CoRL), 2023a.
- Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In Int. Conf. Mach. Learn. (ICML), 2023b.
- VPP: efficient conditional 3d generation via voxel-point progressive representation. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023c.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn. (ICML), pages 8748–8763. PMLR, 2021.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 2019.
- Benchmarking and analyzing point cloud classification under corruptions. In Int. Conf. Mach. Learn. (ICML), 2022.
- Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018a.
- Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, 2018b.
- Distilled feature fields enable few-shot language-guided manipulation. In Annu. Conf. Robot. Learn. (CoRL), 2023a.
- Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023b.
- Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. In Annu. Conf. Robot. Learn. (CoRL), 2023.
- Shutterstock. Turbosquid. https://www.turbosquid.com/.
- Multi-view convolutional neural networks for 3d shape recognition. In Int. Conf. Comput. Vis. (ICCV), 2015.
- Modelnet40-c: A robustness benchmark for 3d point cloud recognition under corruption. In ICLR 2022 Workshop on Socially Responsible Machine Learning.
- Generative multimodal models are in-context learners. CoRR, abs/2312.13286, 2023a.
- EVA-CLIP: improved training techniques for CLIP at scale. CoRR, abs/2303.15389, 2023b.
- Generative pretraining in multimodality. CoRR, abs/2307.05222, 2023c.
- Vipergpt: Visual inference via python execution for reasoning. CoRR, abs/2303.08128, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), pages 1588–1597, 2019.
- Vladimir Vapnik. Statistical learning theory. Wiley, 1998.
- Attention is all you need. In Adv. Neural Inform. Process. Syst. (NIPS), pages 5998–6008, 2017.
- Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. In Int. Conf. Comput. Vis. (ICCV), 2023.
- Voyager: An open-ended embodied agent with large language models. CoRR, abs/2305.16291, 2023a.
- Normalized object coordinate space for category-level 6d object pose and size estimation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2019a.
- Dynamic graph CNN for learning on point clouds. ACM Trans. Graph., 38(5):146:1–146:12, 2019b.
- Take-a-photo: 3d-to-2d generative pre-training of point cloud models. In Int. Conf. Comput. Vis. (ICCV), 2023b.
- Point primitive transformer for long-term 4d point cloud video understanding. In Eur. Conf. Comput. Vis. (ECCV), 2022.
- CAPTRA: category-level pose tracking for rigid and articulated objects from point clouds. In Int. Conf. Comput. Vis. (ICCV), 2021.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671, 2023a.
- Next-gpt: Any-to-any multimodal LLM. CoRR, abs/2309.05519, 2023b.
- Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation. CoRR, abs/2401.04092, 2024.
- 3d shapenets: A deep representation for volumetric shapes. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), pages 1912–1920, 2015.
- Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Eur. Conf. Comput. Vis. (ECCV), pages 574–591. Springer, 2020.
- Pointllm: Empowering large language models to understand point clouds. CoRR, abs/2308.16911, 2023a.
- Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023b.
- Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers), 2023c.
- ULIP: learning unified representation of language, image and point cloud for 3d understanding. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023a.
- ULIP-2: towards scalable multimodal pre-training for 3d understanding. CoRR, abs/2305.08275, 2023b.
- Gpt4tools: Teaching large language model to use tools via self-instruction. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2023a.
- MM-REACT: prompting chatgpt for multimodal reasoning and action. CoRR, abs/2303.11381, 2023b.
- mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023.
- 3d question answering. IEEE Transactions on Visualization and Computer Graphics, 2022.
- A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph., 35(6):1–12, 2016.
- Syncspeccnn: Synchronized spectral CNN for 3d shape segmentation. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2017.
- Deep part induction from articulated object pairs. ACM Trans. Graph., 37(6):209, 2018.
- Make a donut: Language-guided hierarchical emd-space planning for zero-shot deformable object manipulation. CoRR, abs/2311.02787, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. CoRR, abs/2308.02490, 2023.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2022.
- CLIP-FO3D: learning free open-world 3d scene representations from 2d dense CLIP. In Int. Conf. Comput. Vis. Worksh. (ICCV Workshop), 2023a.
- Self-distillation: Towards efficient and compact neural networks. IEEE Trans. Pattern Anal. Mach. Intell., 44(8):4388–4403, 2022a.
- Region-aware knowledge distillation for efficient image-to-image translation. In Brit. Mach. Vis. Conf. (BMVC), 2023b.
- Pointdistiller: Structured knowledge distillation towards efficient and compact 3d detection. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023c.
- Point-m2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In Adv. Neural Inform. Process. Syst. (NeurIPS), 2022b.
- Pointclip: Point cloud understanding by CLIP. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2022c.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. CoRR, abs/2303.16199, 2023d.
- Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023e.
- Gpt4roi: Instruction tuning large language model on region-of-interest. CoRR, abs/2307.03601, 2023f.
- Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219, 2023g.
- Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. CoRR, abs/2307.09474, 2023.
- Indexing 3d scenes using the interaction bisector surface. ACM Trans. Graph., 33(3):22:1–22:14, 2014.
- CAMS: canonicalized manipulation spaces for category-level functional hand-object manipulation synthesis. In IEEE/CVF Conf. Comput. Vis. Pattern Recog. (CVPR), 2023a.
- Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685, 2023b.
- Uni3d: Exploring unified 3d representation at scale. In Int. Conf. Learn. Represent. (ICLR), 2024a.
- Analyzing and mitigating object hallucination in large vision-language models. In Int. Conf. Learn. Represent. (ICLR), 2024b.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023a.
- Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In Int. Conf. Comput. Vis. (ICCV), 2023b.
- 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Int. Conf. Comput. Vis. (ICCV), 2023c.
- Zekun Qi (10 papers)
- Runpei Dong (21 papers)
- Shaochen Zhang (4 papers)
- Haoran Geng (30 papers)
- Chunrui Han (21 papers)
- Zheng Ge (60 papers)
- Li Yi (111 papers)
- Kaisheng Ma (46 papers)