Honeybee: Locality-enhanced Projector for Multimodal LLM (2312.06742v2)
Abstract: In Multimodal LLMs (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens, crucial for MLLMs' overall efficiency, and (ii) preservation of local context from visual features, vital for spatial understanding. Based on these findings, we propose a novel projector design that is both flexible and locality-enhanced, effectively satisfying the two desirable properties. Additionally, we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments, we examine the impact of individual design choices. Finally, our proposed MLLM, Honeybee, remarkably outperforms previous state-of-the-art methods across various benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee.
- Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS, 2022.
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966, 2023a.
- TouchStone: Evaluating Vision-Language Models by Language Models. arXiv preprint arXiv:2308.16890, 2023b.
- Language Models are Few-shot Learners. In NeurIPS, 2020.
- COYO-700M: Image-Text Pair Dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- MiniGPT-v2: Large Language Model as A Unified Interface for Vision-Language Multi-task Learning. arXiv preprint arXiv:2310.09478, 2023a.
- Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195, 2023b.
- Uniter: Universal Image-Text Representation Learning. In ECCV, 2020.
- Can Large Language Models Be an Alternative to Human Evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500, 2023.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Pengi: An Audio Language Model for Audio Tasks. arXiv preprint arXiv:2305.11834, 2023.
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394, 2023.
- LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv preprint arXiv:2304.15010, 2023.
- Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR, 2017.
- 3D-LLM: Injecting the 3D World into Large Language Models. arXiv preprint arXiv:2307.12981, 2023.
- LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
- Squeeze-and-excitation networks. In CVPR, 2018.
- GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In CVPR, 2019.
- Referitgame: Referring to Objects in Photographs of Natural Scenes. In EMNLP, 2014.
- Large Language Models are Temporal and Causal Reasoners for Video Question Answering. In EMNLP, 2023.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV, 2017.
- Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
- Seed-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML, 2022.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML, 2023b.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023c.
- VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation. arXiv preprint arXiv:2106.04632, 2021.
- Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics, 2023a.
- Aligning Large Multi-Modal Model with Robust Instruction Tuning. arXiv preprint arXiv:2306.14565, 2023b.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023c.
- Visual Instruction Tuning. In NeurIPS, 2023d.
- MMBench: Is Your Multi-modal Model an All-around Player? arXiv preprint arXiv:2307.06281, 2023e.
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv preprint arXiv:2303.16634, 2023f.
- A convnet for the 2020s. In CVPR, 2022.
- The flan collection: Designing data and methods for effective instruction tuning. In ICML, 2023.
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In NeurIPS, 2022.
- An Empirical Study of Scaling Instruct-tuned Large Multimodal Models. arXiv preprint arXiv:2309.09958, 2023.
- Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
- OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR, 2019.
- OpenAI. ChatGPT, 2023a.
- OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023b.
- Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824, 2023.
- Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. In ECCV, 2022.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Finetuned language models are zero-shot learners. In ICLR, 2022.
- Aggregated residual transformations for deep neural networks. In CVPR, 2017.
- PointLLM: Empowering Large Language Models to Understand Point Clouds. arXiv preprint arXiv:2308.16911, 2023.
- Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs. arXiv preprint arXiv:2310.00582, 2023.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Ferret: Refer and Ground Anything Anywhere at Any Granularity. arXiv preprint arXiv:2310.07704, 2023.
- Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
- Modeling Context in Referring Expressions. In ECCV, 2016.
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv preprint arXiv:2308.02490, 2023.
- Florence: A New Foundation Model for Computer Vision. arXiv preprint arXiv:2111.11432, 2021.
- What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? arXiv preprint arXiv:2307.02469, 2023.
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. arXiv preprint arXiv:2303.16199, 2023a.
- LLaVAR: Enhanced Visual Instruction Tuning for Text-rich Image Understanding. arXiv preprint arXiv:2306.17107, 2023b.
- Multimodal Chain-of-Thought Reasoning in Language Models. arXiv preprint arXiv:2302.00923, 2023c.
- SVIT: Scaling up Visual Instruction Tuning. arXiv preprint arXiv:2307.04087, 2023.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592, 2023.
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR, 2021.
- Junbum Cha (10 papers)
- Wooyoung Kang (6 papers)
- Jonghwan Mun (16 papers)
- Byungseok Roh (16 papers)