Emergent Mind

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

(2403.11703)
Published Mar 18, 2024 in cs.CV and cs.AI

Abstract

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

LLaVA-UHD framework splits high-resolution images into slices for efficient LLM processing using 2D interpolation.

Overview

  • LLaVA-UHD introduces a novel strategy for Large Multimodal Models (LMMs) to efficiently process high-resolution images in any aspect ratio.

  • The model employs an innovative image modularization strategy, alongside a compression module, and a spatial schema for enhanced image understanding.

  • Experimental results show LLaVA-UHD outperforms existing models in accuracy across nine benchmarks, supporting larger resolution images with lower computational requirements.

  • This advancement promises significant improvements in real-world applications of AI, offering a more adaptable and efficient approach to image processing.

LLaVA-UHD: Efficiently Handling Any Aspect Ratio and High-Resolution Images in Large Multimodal Models

Introduction

The capabilities of multimodal understanding, reasoning, and interaction witnessed substantial advancements, which is largely attributed to the integration of visual signals into LLMs. This integration hinges on efficient and adaptive visual encoding strategies. Current Large Multimodal Models (LMMs), however, fall short in efficiently handling images of varying aspect ratios and high resolutions, which is paramount for real-world applications. This paper introduces LLaVA-UHD, a novel LMM that efficiently processes images in any aspect ratio and high resolution. LLaVA-UHD addresses the highlighted shortcomings via an innovative image modularization strategy, a compression module, and a spatial schema for slice organization.

Systematic Flaws in Existing Models

Investigating GPT-4V and LLaVA-1.5, the study identifies their systematic flaws in visual encoding, particularly in correctly perceiving high-resolution images. The findings underscore a potential vulnerability to adversarial attacks, emphasizing the need for improved visual encoding strategies.

Core Components of LLaVA-UHD

  1. Image Modularization Strategy: This component divides native-resolution images into smaller, variable-sized slices, adapting efficiently to any aspect ratio and resolution. Unlike previous methods relying on fixed aspect ratios, LLaVA-UHD's approach ensures full adaptivity with minimal deviation from the visual encoders' pretraining settings.
  2. Compression Module: To manage the processing demands of high-resolution images, a compression layer further condenses image tokens, reducing the computational load on LLMs.
  3. Spatial Schema: A novel spatial schema organizes slice tokens, providing LLMs with contextual information about slice positions within the image. This aids the model in understanding the global structure of the image from its parts.

Experimental Findings

LLaVA-UHD demonstrates superior performance across nine benchmarks, outstripping existing models trained on significantly larger datasets. Noteworthy improvements include a 6.4 point increase in accuracy on TextVQA and a 3.2 point increase on POPE when comparing to LLaVA-1.5. Moreover, it supports images six times larger in resolution while requiring only 94% of LLaVA-1.5's inference computation.

Practical Implications and Theoretical Significance

LLaVA-UHD's approach contributes to the broader field of AI by offering an efficient solution for processing high-resolution images within LMMs, without sacrificing performance or computational efficiency. The model's adaptability to any aspect ratio and resolution reflects a significant step toward handling real-world images more effectively.

Future Directions

The paper hints at future exploration into encoding higher-resolution images and tasks such as small object detection, emphasizing the need for continued advancement in visual encoding strategies within multimodal systems.

Conclusion

LLaVA-UHD represents a critical advancement in the visual perception capabilities of LMMs. By addressing the fundamental limitations around aspect ratio adaptability and the processing of high-resolution images, the model sets a new benchmark for efficiency and accuracy in multimodal AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.