LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (2403.11703v1)

Published 18 Mar 2024 in cs.CV and cs.AI

Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at https://github.com/thunlp/LLaVA-UHD.

References (44)

Citations (67)

View on Semantic Scholar

Summary

The paper introduces a modularization strategy that divides images into adaptive slices to handle any aspect ratio and high resolution.
It employs a compression module to reduce token processing while maintaining robust performance in high-resolution image analysis.
Experimental results show significant accuracy improvements on benchmarks like TextVQA and POPE with reduced computational load.

LLaVA-UHD: Efficiently Handling Any Aspect Ratio and High-Resolution Images in Large Multimodal Models

Introduction

The capabilities of multimodal understanding, reasoning, and interaction witnessed substantial advancements, which is largely attributed to the integration of visual signals into LLMs. This integration hinges on efficient and adaptive visual encoding strategies. Current Large Multimodal Models (LMMs), however, fall short in efficiently handling images of varying aspect ratios and high resolutions, which is paramount for real-world applications. This paper introduces LLaVA-UHD, a novel LMM that efficiently processes images in any aspect ratio and high resolution. LLaVA-UHD addresses the highlighted shortcomings via an innovative image modularization strategy, a compression module, and a spatial schema for slice organization.

Systematic Flaws in Existing Models

Investigating GPT-4V and LLaVA-1.5, the paper identifies their systematic flaws in visual encoding, particularly in correctly perceiving high-resolution images. The findings underscore a potential vulnerability to adversarial attacks, emphasizing the need for improved visual encoding strategies.

Core Components of LLaVA-UHD

Image Modularization Strategy: This component divides native-resolution images into smaller, variable-sized slices, adapting efficiently to any aspect ratio and resolution. Unlike previous methods relying on fixed aspect ratios, LLaVA-UHD's approach ensures full adaptivity with minimal deviation from the visual encoders' pretraining settings.
Compression Module: To manage the processing demands of high-resolution images, a compression layer further condenses image tokens, reducing the computational load on LLMs.
Spatial Schema: A novel spatial schema organizes slice tokens, providing LLMs with contextual information about slice positions within the image. This aids the model in understanding the global structure of the image from its parts.

Experimental Findings

LLaVA-UHD demonstrates superior performance across nine benchmarks, outstripping existing models trained on significantly larger datasets. Noteworthy improvements include a 6.4 point increase in accuracy on TextVQA and a 3.2 point increase on POPE when comparing to LLaVA-1.5. Moreover, it supports images six times larger in resolution while requiring only 94% of LLaVA-1.5's inference computation.

Practical Implications and Theoretical Significance

LLaVA-UHD's approach contributes to the broader field of AI by offering an efficient solution for processing high-resolution images within LMMs, without sacrificing performance or computational efficiency. The model's adaptability to any aspect ratio and resolution reflects a significant step toward handling real-world images more effectively.

Future Directions

The paper hints at future exploration into encoding higher-resolution images and tasks such as small object detection, emphasizing the need for continued advancement in visual encoding strategies within multimodal systems.

Conclusion

LLaVA-UHD represents a critical advancement in the visual perception capabilities of LMMs. By addressing the fundamental limitations around aspect ratio adaptability and the processing of high-resolution images, the model sets a new benchmark for efficiency and accuracy in multimodal AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1769948963308794101

https://twitter.com/fly51fly/status/1770204667915755988

https://twitter.com/KyeGomezB/status/1770108199431184761

https://twitter.com/knishimae0531/status/1770287241954459772

https://twitter.com/gm8xx8/status/1769992355048775934