Emergent Mind

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

(2407.08303)
Published Jul 11, 2024 in cs.CV and cs.AI

Abstract

Existing Multimodal LLMs (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hinges on the availability of high-quality image-text datasets that offer diverse visual elements and throughout image descriptions. However, the scarcity of such hyper-detailed datasets currently hinders progress within the MLLM community. The bottleneck stems from the limited perceptual capabilities of current caption engines, which fall short in providing complete and accurate annotations. To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions. Specifically, Perceptual Fusion integrates diverse perception experts as image priors to provide explicit information on visual elements and adopts an efficient MLLM as a centric pivot to mimic advanced MLLMs' perception abilities. We carefully select 1M highly representative images from uncurated LAION dataset and generate dense descriptions using our engine, dubbed DenseFusion-1M. Extensive experiments validate that our engine outperforms its counterparts, where the resulting dataset significantly improves the perception and cognition abilities of existing MLLMs across diverse vision-language benchmarks, especially with high-resolution images as inputs. The dataset and code are publicly available at https://github.com/baaivision/DenseFusion.

Perceptual Fusion pipeline for DenseFusion-1M using multimodal models and 100K meta dataset from GPT-4V.

Overview

  • The paper presents DenseFusion-1M, a dataset aimed at improving the perceptual capabilities of Multimodal LLMs (MLLMs) by providing hyper-detailed image descriptions.

  • The methodology involves selecting 1 million high-resolution images and integrating various vision experts for dense and accurate image descriptions, enhancing the richness and detail of the multimodal data.

  • Experiments show that models trained with DenseFusion-1M outperform state-of-the-art models in various vision-language benchmarks, demonstrating substantial improvements in detailed visual perception.

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

The paper "DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception," authored by Xiaotong Li et al., introduces a novel approach to enhance the perceptual capabilities of Multimodal LLMs (MLLMs) through hyper-detailed image descriptions. This approach is predicated on integrating various specialized vision experts to construct a high-quality image-text dataset named DenseFusion-1M. The work addresses the significant challenge of acquiring detailed and comprehensive multimodal datasets, which are essential for training MLLMs to accurately perceive and interpret diverse visual information.

Introduction and Motivation

The existing MLLMs have demonstrated considerable progress in multimodal understanding and reasoning by leveraging Large Vision Models (LVMs) and LLMs. However, their performance is constrained by the limited availability of high-quality, detailed image-text datasets. Traditional caption engines fail to provide detailed and accurate annotations necessary for comprehensive visual perception. This paper proposes a solution through Perceptual Fusion, employing a low-budget yet effective caption engine that integrates various vision experts to generate dense and accurate image descriptions.

Methodology

The methodology involves a two-step process: data pre-processing and perceptual fusion. The authors selected 1 million highly representative images from the LAION dataset, ensuring high resolution and diversity, resulting in a subset named DenseFusion-1M. They then devised a Perceptual Fusion strategy, combining insights from multiple vision experts, including image tagging, object detection, text recognition, and world knowledge.

Data Pre-Processing

The data pre-processing phase involves:

  1. High-Resolution Image Selection: Filtering images with a minimum short-edge resolution of 448 pixels to ensure rich visual content.
  2. Semantic Clustering and De-duplication: Using k-means clustering on image features extracted via EVA-CLIP and removing semantic duplicates within clusters to maintain diverse and high-quality data.

Perceptual Fusion

The Perceptual Fusion pipeline integrates multiple vision experts:

  • Image Tagging: Utilizing RAM++ for scene-level understanding.
  • Object Detection: Employing EVA02 for closed-set object detection and OWL-ViTv2 for open-set detection to recognize a wide range of objects.
  • Text Recognition: Leveraging OCR models to capture textual information within images.
  • World Knowledge: Incorporating context and background information from LAION's short captions.

The fusion strategy combines these elements to feed supplementary information into an efficient MLLM, using GPT-4V for generating initial captions that guide the training of a robust caption engine based on LLaVA-1.6.

Dataset Description

The DenseFusion-1M dataset comprises 1 million hyper-detailed image-text pairs, enhancing the semantic richness and detail of the image descriptions. The dataset offers comprehensive annotations that include object attributes, spatial relations, text information, and world knowledge. Each description averages 190 words and 11 sentences, significantly enriching the input data for multimodal training.

Experiments and Results

The authors conducted extensive experiments to validate the effectiveness of DenseFusion-1M across various vision-language benchmarks such as VQAv2, GQA, TextVQA, and others. The results demonstrated that models trained with DenseFusion-1M outperformed state-of-the-art models, particularly in tasks requiring detailed visual perception. The use of high-resolution images further amplified the benefits, showcasing substantial improvements in text recognition and high-resolution image perception.

Implications and Future Work

The DenseFusion-1M dataset sets a new standard for multimodal datasets, enabling MLLMs to achieve better vision-language alignment through detailed and accurate annotations. The implications of this work are significant for areas requiring meticulous visual understanding, such as autonomous driving, medical imaging, and advanced human-computer interaction.

Future work could explore the integration of additional vision experts and the application of DenseFusion-1M in broader contexts. The potential for enhancing conditional image generation tasks also warrants investigation, as demonstrated by the initial qualitative results in the paper.

Conclusion

The DenseFusion-1M dataset represents a substantial contribution to the field of multimodal perception. By integrating diverse vision experts and generating hyper-detailed image descriptions, this work provides a robust foundation for training advanced MLLMs. The detailed methodological approach and the significant improvements demonstrated across multiple benchmarks highlight the value and potential of this innovative dataset.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.