Emergent Mind

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

(2408.08872)
Published Aug 16, 2024 in cs.CV , cs.AI , and cs.CL

Abstract

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

Overview

  • xGen-MM (BLIP-3) presents a comprehensive framework for developing Large Multimodal Models (LMMs) with an emphasis on extensive dataset curations and simplified training processes.

  • Key innovations include enhanced vision-language integration using a perceiver resampler, unified training objectives focusing on auto-regressive loss, and significant performance improvements across multiple benchmarks.

  • The paper highlights the open-sourcing of models and datasets to encourage further research, with practical applications in areas like automated content generation and interactive AI, and outlines potential future developments in adaptive AI systems.

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

The paper on xGen-MM (BLIP-3) introduces a comprehensive framework for developing Large Multimodal Models (LMMs). This framework encapsulates curatorial efforts in dataset compilation, a detailed training recipe, and various model architectures. The resulting suite of models, collectively referred to as xGen-MM, demonstrates proficiency across an array of tasks including single-image and multi-image benchmarks.

Key Components and Innovations

The primary focus of the xGen-MM framework is to scale up the training process for LMMs by leveraging a mixture of curated interleaved and caption datasets. Based on the summary and evaluation figures presented, this framework builds upon the foundation established by BLIP-2 and introduces several critical improvements:

  1. Expanded Dataset and Training Recipe:

    • The authors curated extensive datasets, comprising MINT-1T and BLIP3-KALE among others, to provide a broader and richer training base. The focus was on scaling both the quantity and quality of datasets to cover a wide range of multimodal interleaved data and high-quality dense captions.
  2. Enhanced Vision-Language Integration:

    • The Q-Former layers from BLIP-2 were replaced with a more effective and scalable vision token sampler. This token sampler employs a perceiver resampler, optimizing the process of downsampling image embeddings to be conducive to processing by the pre-trained Large Language Model (LLM).
  3. Simplified Training Objectives:

    • The training process was simplified by unifying the training objectives to concentrate solely on the auto-regressive loss of text tokens, diverging from the multifaceted loss functions previously used. This unification streamlines the training and enhances scalability.

Strong Numerical Results

The xGen-MM models achieved competitive performance across multiple benchmarks, maintaining robust performance in both single and multi-image scenarios. For instance:

Few-shot Learning Capabilities:

- On the Visual Question Answering (VQA) datasets such as VQAv2 and TextVQA, the xGen-MM models exhibited significant advancements. It recorded 66.9% on VQAv2 and 55.3% on TextVQA in an 8-shot setting, demonstrating superiority over comparable models like MM1-3B and even larger models such as Idefics-9B.

Captioning Tasks:

- The models also excelled in captioning benchmarks like COCO and NoCaps, reaching scores of 109.8 and 104.6, respectively, in an 8-shot setting. Additionally, the performance boost seen in zero-shot to few-shot transitions underscores their adeptness at in-context learning.

Implications and Future Developments

The contributions of the xGen-MM (BLIP-3) extend beyond mere performance metrics. By open-sourcing their models and curated datasets, the authors provide valuable resources that foster further research and development within the community. Such transparency and accessibility address a noted gap between proprietary and open-source models, democratizing the ability to replicate, understand, and improve upon these foundational models.

Practical and Theoretical Implications: The practical applications of these advancements include enhanced capabilities in visual and textual data processing, potentially impacting areas such as automated content generation, interactive AI, and more nuanced multimodal understanding. The theoretically unified model architecture and training objectives may stimulate further exploration into simplifying and scaling multimodal model training.

Speculative Future Developments: The potential for xGen-MM extends into the realm of adaptive AI systems capable of interpreting and generating multimodal content with higher fidelity and contextual accuracy. Future developments could explore the integration of bounding box information for OCR tasks, further refinement of the instruction-aware perception modules, and optimized token samplers for various encoded inputs.

Conclusion

The xGen-MM (BLIP-3) framework represents a significant step toward scalable and accessible LMMs. Its robust model architecture, coupled with extensive dataset curation, yields models capable of high performance across diverse benchmarks. By lowering barriers for the research community to access and utilize these advancements, xGen-MM (BLIP-3) lays the groundwork for accelerated innovation in large multimodal models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube