TokenPacker: Efficient Visual Projector for Multimodal LLM (2407.02392v4)
Abstract: The visual projector serves as an essential bridge between the visual encoder and the LLM in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:2308.12966, 2023.
- Introducing our multimodal models, 2023.
- Token merging: Your vit but faster. In ICLR, 2023.
- Honeybee: Locality-enhanced projector for multimodal llm. In CVPR, 2024.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023.
- How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024.
- Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024.
- Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv:2312.16886, 2023.
- Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
- Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
- Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
- Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
- IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
- Mixtral of experts. arXiv:2401.04088, 2024.
- From clip to dino: Visual encoders shout in multi-modal large language models. 2023.
- Segment anything. In ICCV, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017.
- Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
- Evaluating object hallucination in large vision-language models. In EMNLP, 2023.
- Monkey: Image resolution and text label are important things for large multi-modal models. In CVPR, 2024.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
- Visual instruction tuning. In NeurIPS, 2023.
- Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023.
- On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023.
- A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
- Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
- Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021.
- Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019.
- OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388, 2024.
- Towards vqa models that can read. In CVPR, 2019.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
- Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv preprint arXiv:2403.11703, 2024.
- Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. In ICML, 2024.
- Osprey: Pixel understanding with visual instruction tuning. In CVPR, 2024.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024.
- Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024.