Matryoshka Query Transformer for Large Vision-Language Models (2405.19315v2)
Abstract: Large Vision-LLMs (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a LLM. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. ArXiv preprint.
- Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations.
- Making large multimodal models understand arbitrary visual prompts. In IEEE Conference on Computer Vision and Pattern Recognition.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
- Vicuna: An opensource chatbot impressing gpt-4 with 90% chatgpt quality. ArXiv preprint.
- Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886.
- Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Adaptive token sampling for efficient vision transformers. In European Conference on Computer Vision.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
- Sparseformer: Sparse visual recognition via limited latent tokens. In The Twelfth International Conference on Learning Representations.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR.
- 3d-llm: Injecting the 3d world into large language models. NeurIPS.
- Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024.
- Language is not all you need: Aligning perception with language models.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR.
- IDEFICS. 2023. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics.
- Phi-2: The surprising power of small language models. Microsoft Research Blog.
- Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:2402.03161.
- Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Matformer: Nested transformer for elastic inference. arXiv preprint arXiv:2310.07707.
- Matryoshka representation learning. In Advances in Neural Information Processing Systems, volume 35, pages 30233–30249. Curran Associates, Inc.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. of ICML.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
- Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043.
- Evaluating object hallucination in large vision-language models. In Proc. of EMNLP.
- EVit: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations.
- Improved baselines with visual instruction tuning. ArXiv preprint.
- Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS.
- OpenAI. 2023. Gpt-4 technical report.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems, volume 34, pages 13937–13949. Curran Associates, Inc.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626.
- Large language models as generalizable policies for embodied tasks. ICLR.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- mplug-owl: Modularization empowers large language models with multimodality. ArXiv preprint.
- A-ViT: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. In International conference on machine learning. PMLR.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore. Association for Computational Linguistics.
- Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2978–2988.
- Tinyllava: A framework of small-scale large multimodal models.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Llava-phi: Efficient multi-modal assistant with small language model.