LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (2403.11703v1)
Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.
- Introducing ChatGPT. https://openai.com/blog/chatgpt. 2022.
- GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- VQA: Visual question answering. In IEEE ICCV, pages 2425–2433, 2015.
- Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Introducing our multimodal models. adept.ai/blog/fuyu-8b. 2023.
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL, pages 4171–4186, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- MME: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- VizWiz grand challenge: Answering visual questions from blind people. In IEEE CVPR, pages 3608–3617, 2018.
- Deep residual learning for image recognition. In IEEE CVPR, pages 770–778, 2016.
- CogAgent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
- GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE CVPR, pages 6700–6709, 2019.
- Visual Genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017.
- OtterHD: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219, 2023a.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML, 2023b.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
- Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023d.
- SPHINX: the joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- InfiMM-HD: A leap forward in high-resolution multimodal understanding. arXiv preprint arXiv:2403.01487, 2024a.
- LLaVA-NeXT: Improved reasoning, ocr, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/. 2024.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. NeurIPS, 36, 2024b.
- MMBench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
- Feast your eyes: Mixture-of-resolution adaptation for multimodal large language models. arXiv preprint arXiv:2403.03003, 2024.
- OCR-VQA: Visual question answering by reading text in images. In IEEE ICDAR, pages 947–952, 2019.
- DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- LAION-5B: An open large-scale dataset for training next generation image-text models. NeurIPS, pages 25278–25294, 2022.
- Towards VQA models that can read. In IEEE CVPR, pages 8317–8326, 2019.
- Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525, 2023.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- CogVLM: Visual expert for large language models. 2023.
- Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
- The dawn of LMMs: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
- UReader: Universal OCR-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023a.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023b.
- RLHF-V: Towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback. In CVPR, 2024.