Emergent Mind

Chameleon: Mixed-Modal Early-Fusion Foundation Models

(2405.09818)
Published May 16, 2024 in cs.CL

Abstract

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Chameleon processes images, text, and code using a unified transformer, generating mixed-modal documents.

Overview

  • Chameleon is a new multimodal AI model that processes both text and image data using a unified token-based transformer architecture, allowing for seamless integration of the two data modes.

  • The model uses innovative tokenization and training techniques, including a new image tokenizer and a two-stage training process to handle large-scale datasets, as well as architectural modifications to ensure stable and scalable performance.

  • Chameleon demonstrates strong performance across various tasks, including image-to-text, visual question answering, and text-only tasks, outperforming several benchmark models and showcasing its potential for content creation, visual question answering, and educational applications.

Chameleon: The New Contender in Multimodal AI

In the ever-evolving landscape of multimodal AI, a paper introduces Chameleon, a collection of foundation models that handle both image and text data using a unified, token-based architecture. Let's break down what this model brings to the table and why it's intriguing for data scientists who are keen on multimodal applications.

Overview

Chameleon stands out because it bridges the gap between text and image processing seamlessly. Traditional multimodal models often employ different encoders or decoders for each type of data, which can limit their ability to integrate information across both modes. Chameleon, however, adopts a fully token-based approach for both images and text. By quantizing images into discrete tokens, similar to how words are represented in text, Chameleon uses a single transformer architecture to process mixed sequences of text and image tokens.

But this early-fusion method doesn't come without its challenges. Ensuring stable and scalable training for such a model involves significant architectural innovations and training techniques, which we'll explore further.

Key Innovations

Tokenization & Training

One of Chameleon's major breakthroughs is in its tokenization approach. Images are converted into tokens using a new image tokenizer, which segments a $512\times512$ image into 1024 discrete tokens. For text, it employs a Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 65,536, synergizing the text and image token sets.

Training is divided into two stages over an extensive dataset of both text and image data:

  • Stage 1 involves training on large-scale datasets.
  • Stage 2 incorporates higher quality datasets, with a focus on fine-tuning.

Architectural Solutions for Stability

Scaling Chameleon posed stability challenges, particularly when extending beyond 8 billion parameters and 1 trillion tokens. Here are some architectural modifications that were crucial:

Optimization Strategies

To further enhance stability, Chameleon employs several optimization techniques:

  • AdamW Optimizer: Tweaked with parameters such as $\beta1 = 0.9$, $\beta2 = 0.95$, and $\epsilon = 10{-5}$.
  • z-loss Regularization: Helps mitigate logit drift in the final softmax layer by regularizing the partition function.

Evaluation

Chameleon demonstrates impressive capabilities across an array of tasks.

Image-to-Text and Visual Question Answering

The model shows strong performance in image captioning on COCO and Flickr30k datasets, as well as visual question answering with VQAv2 benchmarks. Here are some notable results:

  • COCO Captioning: Outperformed Flamingo-80B and IDEFICS-80B models.
  • VQAv2: Achieved competitive scores with other fine-tuned models like Flamingo-80B-FT and IDEFICS-80B-Instruct.

Text-Only Tasks

Chameleon holds its ground on text-only tasks as well. It performs admirably on commonsense reasoning and reading comprehension benchmarks such as PIQA, SIQA, and HellaSwag. For world knowledge and math problems, it also shows strong results, especially on GSM8k and MATH benchmarks, rivaling or surpassing models like LLaMa-2 and Mixtral 8x7B.

Practical Implications

Chameleon's unified approach can be transformative in areas requiring seamless integration of text and imagery, such as:

  • Content Creation: Generation of mixed-modal content with coherent, interleaved text and images.
  • Visual Question Answering: Enhancing interactive AI systems that can answer queries about visual content.
  • Educational Tools: Improving educational applications that explain concepts using a combination of images and text.

Future Directions

Chameleon's architecture and training strategies offer a robust foundation, but there are areas ripe for further exploration:

  • Fine-tuning: More targeted fine-tuning could enhance performance on specific downstream tasks.
  • Expansion to Other Modalities: Incorporating additional data types such as audio or video tokens could make the models even more versatile.
  • Optimization for Real-World Applications: Fine-tuning to improve robustness and efficiency in real-world, multimodal applications.

In summary, Chameleon offers a promising glimpse into the future of multimodal AI, blending textual and visual data in ways that existing models haven't. Its token-based, unified architecture could be a step towards more intelligent and integrated AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit
Chameleon: Meta's New Multi-Modal LLM (1 point, 1 comment) in /r/hackernews
Chameleon: Meta's New Multi-Modal LLM (1 point, 0 comments) in /r/hypeurls