Emergent Mind

Dense Connector for MLLMs

(2405.13800)
Published May 22, 2024 in cs.CV and cs.AI

Abstract

Do we fully leverage the potential of visual encoder in Multimodal LLMs (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development.

Dense Connector in MLLM: Overview with Three Instantiations showing tokens, feature dimension, and downsampling ratio.

Overview

  • The paper introduces the Dense Connector (DC) to enhance Multimodal LLMs (MLLMs) by integrating multi-layer visual features, addressing the underutilization of visual signals in current models.

  • Three strategies—Sparse Token Integration, Sparse Channel Integration, and Dense Channel Integration—are proposed for incorporating visual features from various layers, demonstrating minimal computational overhead.

  • Experimental validation shows that the Dense Connector significantly improves image and video understanding across various encoders, resolutions, and datasets, achieving state-of-the-art performance and effective zero-shot video understanding.

Enhancing Multimodal LLMs with Dense Connector

The paper introduces a novel approach to improving Multimodal LLMs (MLLMs) by focusing on the integration of multi-layer visual features through a Dense Connector (DC). This method addresses the underutilization of visual signals in current MLLMs and aims to leverage the potential richness of pre-trained visual encoders. The approach is validated across a broad range of tasks and architectures, demonstrating strong improvements in both image and video understanding.

The paper begins by highlighting a gap in current MLLM research, which predominantly focuses on enhancing the linguistic capabilities of these models. Typically, the visual component is extracted by a frozen visual encoder, leading to the loss of potentially valuable intermediate features. In response, the authors propose the Dense Connector - a plug-and-play method that captures and integrates visual features from various layers of the visual encoder, thereby providing richer visual context to the language model with minimal computational overhead.

Methodology

The Dense Connector employs three different strategies to integrate multi-layer visual features into MLLMs:

  1. Sparse Token Integration (STI): This method aggregates visual tokens from specified layers, downsampling them where necessary, and concatenates these features before passing them through a learnable projector.
  2. Sparse Channel Integration (SCI): Instead of increasing the number of tokens, this approach concatenates visual features from different layers at the channel level, followed by a projector to map these into the text space.
  3. Dense Channel Integration (DCI): This extends SCI by densely integrating adjacent layers, reducing redundancy and adding robustness through grouped additive fusion.

The authors argue that these strategies constitute efficient and straightforward mechanisms for significantly enhancing visual representations used within MLLMs.

Experimental Results

The Dense Connector is extensively validated across variations in visual encoders, image resolutions, training datasets, and LLM architectures, from sizes ranging from 2.7B to 70B parameters. Key findings include:

  • Scalability and Versatility: The approach remains effective across different visual encoders (e.g., CLIP-ViT-L and SigLIP-ViT-SO), resolutions (336px to 768px), and training datasets (such as LLaVA-1.5 and Mini-Gemini). The Dense Connector also integrates seamlessly with various LLMs, demonstrating robustness and generalizability.
  • Performance Improvements: Empirical results show that the Dense Connector achieves state-of-the-art performance on multiple image and video benchmarks. For example, the Dense Connector with a 13B LLM outperformed several existing models on benchmarks like GQA, VQAV2, and MM-Vet.
  • Zero-Shot Video Understanding: Despite being trained solely on image data, the models equipped with the Dense Connector exhibit impressive zero-shot capabilities in video tasks, reinforcing the effectiveness of the multi-layer feature integration.

Implications and Future Directions

The paper's findings suggest significant implications for the development of MLLMs:

  • Enhanced Visual Embeddings: By leveraging multi-layer visual features, the Dense Connector can significantly improve the richness of visual embeddings fed into LLMs. This enhancement facilitates better alignment between visual and textual modalities, thereby improving overall model performance.
  • Computational Efficiency: The Dense Connector employs minimal additional computational resources, making it a viable addition to any MLLM architecture without introducing prohibitive costs.
  • Generalizability: The ability of the Dense Connector to enhance performance across a variety of tasks and models hints at its potential as a standard component in future MLLM architectures.

Conclusion

The paper presents a compelling case for rethinking the use of visual signals in MLLMs. By integrating multi-layer features through the Dense Connector, the authors provide a simple yet effective solution to enhance the visual understanding capabilities of these models. This method not only achieves state-of-the-art results across numerous benchmarks but also demonstrates significant potential for future advancements in MLLM research and applications. As the field progresses, the Dense Connector could become a foundational module, driving future innovations in multimodal AI.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.