An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models (2403.06764v3)
Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-LLMs (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.
- nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 8947–8956, 2019.
- Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv preprint, abs/2308.12966, 2023.
- Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond. ArXiv, 2023.
- Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain. 2024.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- Palm-e: An embodied multimodal language model. volume abs/2303.03378, 2023.
- Model tells you what to discard: Adaptive kv cache compression for llms, 2024.
- Tgif-qa: Toward spatio-temporal reasoning in visual question answering, 2017.
- Videopoet: A large language model for zero-shot video generation, 2023.
- Efficient memory management for large language model serving with pagedattention, 2023.
- Empowering vision-language models to follow interleaved vision-language instructions. arXiv preprint arXiv:2308.04152, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv preprint, abs/2301.12597, 2023b.
- Silkie: Preference distillation for large visual language models, 2023c.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387, 2023d.
- Llama-vid: An image is worth 2 tokens in large language models, 2023e.
- Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023f.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023a.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models, 2023b.
- Ring attention with blockwise transformers for near-infinite context, 2023a.
- World model on million-length video and language with ringattention, 2024a.
- Improved baselines with visual instruction tuning, 2023b.
- Visual instruction tuning. ArXiv preprint, abs/2304.08485, 2023c.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024b. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action, 2023.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
- Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp. 947–952. IEEE, 2019.
- OpenAI. Gpt-4v(ision) system card. 2023.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649, 2015.
- Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 8748–8763, 2021.
- A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pp. 146–162. Springer, 2022.
- Gemini: A family of highly capable multimodal models, 2023.
- Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017.
- Cider: Consensus-based image description evaluation, 2015.
- Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024.
- Label words are anchors: An information flow perspective for understanding in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9840–9855, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.609. URL https://aclanthology.org/2023.emnlp-main.609.
- Efficient streaming language models with attention sinks. arXiv, 2023.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, pp. 1645–1653, 2017a.
- Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017b.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. ArXiv preprint, abs/2309.07915, 2023.
- Gpt-4v(ision) is a generalist web agent, if grounded, 2024.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint, abs/2304.10592, 2023.