Emergent Mind

Matryoshka Query Transformer for Large Vision-Language Models

(2405.19315)
Published May 29, 2024 in cs.CV , cs.CL , and cs.LG

Abstract

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

MQT-LLaVA achieves significant speed-ups with fewer visual tokens, maintaining competitive performance.

Overview

  • The paper introduces the Matryoshka Query Transformer (MQT) to enable flexible visual token budgets in Large Vision-Language Models (LVLMs), enhancing computational efficiency.

  • MQT employs a nested structure inspired by Matryoshka Representation Learning, dynamically adjusting visual tokens during inference, and has been integrated with the LLaVA model.

  • Empirical results show that MQT-LLaVA performs on par with or better than existing models, achieving significant computational savings, especially on tasks that do not require detailed visual understanding.

Flexibly Adapting Visual Token Budgets: An Analysis of Matryoshka Query Transformers in Vision-Language Models

Overview

The paper introduces the concept of the Matryoshka Query Transformer (MQT) to address the challenge of fixed visual token budgets in Large Vision-Language Models (LVLMs). Traditional LVLMs are often constrained by a fixed number of visual tokens, resulting in inefficiencies when adapting to varying computational constraints across different applications. The proposed MQT model allows for a flexible and adaptive number of visual tokens, considerably enhancing computational efficiency while maintaining robust performance.

Matryoshka Query Transformer (MQT)

Inspiration and Concept

Inspired by Matryoshka Representation Learning, MQT employs a query transformer strategy designed to dynamically adjust the number of visual tokens during inference. During training, the model randomly selects a subset of latent query tokens within a predefined maximum, trimming the rest. This approach results in a Matryoshka-like nested structure, where each token's significance is correlated with its hierarchical placement within the structure.

Technical Implementation

The implementation integrates MQT with the Large Vision-Language Model LLaVA, referred to as MQT-LLaVA. The training process is conducted in two stages: initial alignment and subsequent adaptive training with varying numbers of visual tokens. Using this methodology, MQT-LLaVA can effectively encode images into a dynamically chosen number of visual tokens (up to a maximum of 256), as opposed to the fixed 576 tokens in LLaVA-1.5.

Empirical Performance

Strong Numerical Results

MQT-LLaVA, with a maximum of 256 visual tokens, achieves performance on par with or better than LLaVA-1.5 across 11 benchmarks. Remarkably, reducing the token count to 16 (an 8x reduction in TFLOPs) only results in an approximate 2.4-point performance drop on MMBench. Specific tasks such as ScienceQA and MMMU show minimal performance degradation even with as few as 2 visual tokens.

Performance-Efficiency Trade-Offs

The study finds that different tasks have varying dependencies on the number of visual tokens:

  • High Token Requirement: Tasks such as VQAv2, GQA, and MMBench require more tokens for optimal performance due to their need for detailed visual understanding.
  • Low Token Requirement: Other tasks, including ScienceQA and MME Cognition, maintain robust performance with significantly fewer tokens, suggesting that in these contexts, the language model's reasoning capabilities overshadow the need for detailed visual tokens.

The flexible adaptation of visual token budgets enables significant computational savings without notable performance trade-offs, particularly for tasks demanding less fine-grained visual detail.

Implications and Future Research

Practical Impact

The proposed MQT-LLaVA model is highly versatile, making it applicable across diverse computational environments, from resource-constrained mobile devices to high-performance servers. The ability to dynamically adjust visual token budgets allows for real-time processing in applications with varying computational constraints.

Theoretical Contributions

The nested Matryoshka-like structure presents a novel means of organizing and efficiently utilizing visual tokens in LVLMs. This approach could influence future LVLM architectures, encouraging ongoing research into adaptive token strategies that further optimize computational efficiency and performance.

Speculative Future Directions

Looking forward, the principles established by MQT could be applied to other modalities beyond images, potentially influencing video and 3D data processing. Further exploration into the balance between the information density of visual tokens and computational cost stands to benefit the development of more scalable and resource-efficient models.

Conclusion

The Matryoshka Query Transformer model presents a substantive step towards addressing the rigidity of fixed visual token budgets in LVLMs. By enabling adaptive visual token counts during inference, the MQT model delivers substantial computational efficiency gains while preserving robust performance across varied vision-language tasks. This advancement underscores the potential for even more adaptable and efficient vision-language models in the future.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube