Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818v1)

Published 16 May 2024 in cs.CL

Abstract: We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Citations (119)

View on Semantic Scholar

Summary

The paper introduces a unified token-based architecture that seamlessly fuses image and text data for integrated multimodal processing.
It employs innovative tokenization and stabilization techniques, including QK-Norm and revised layer norms, to ensure robust performance at scale.
Empirical evaluations show competitive results in image captioning, visual question answering, and text reasoning compared to state-of-the-art models.

Chameleon: The New Contender in Multimodal AI

In the ever-evolving landscape of multimodal AI, a paper introduces Chameleon, a collection of foundation models that handle both image and text data using a unified, token-based architecture. Let's break down what this model brings to the table and why it's intriguing for data scientists who are keen on multimodal applications.

Overview

Chameleon stands out because it bridges the gap between text and image processing seamlessly. Traditional multimodal models often employ different encoders or decoders for each type of data, which can limit their ability to integrate information across both modes. Chameleon, however, adopts a fully token-based approach for both images and text. By quantizing images into discrete tokens, similar to how words are represented in text, Chameleon uses a single transformer architecture to process mixed sequences of text and image tokens.

But this early-fusion method doesn't come without its challenges. Ensuring stable and scalable training for such a model involves significant architectural innovations and training techniques, which we'll explore further.

Key Innovations

Tokenization & Training

One of Chameleon's major breakthroughs is in its tokenization approach. Images are converted into tokens using a new image tokenizer, which segments a $512\times512$ image into 1024 discrete tokens. For text, it employs a Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 65,536, synergizing the text and image token sets.

Training is divided into two stages over an extensive dataset of both text and image data:

Stage 1 involves training on large-scale datasets.
Stage 2 incorporates higher quality datasets, with a focus on fine-tuning.

Architectural Solutions for Stability

Scaling Chameleon posed stability challenges, particularly when extending beyond 8 billion parameters and 1 trillion tokens. Here are some architectural modifications that were crucial:

Query-Key Normalization (QK-Norm): Applied within the attention mechanism to maintain norm stability.
Revised Layer Norm Placement: Inspired by the Swin Transformer, this reordering stabilizes norm growth in the Transformer blocks.

Optimization Strategies

To further enhance stability, Chameleon employs several optimization techniques:

AdamW Optimizer: Tweaked with parameters such as $\beta_1 = 0.9$ , $\beta_2 = 0.95$ , and $\epsilon = 10^{-5}$ .
z-loss Regularization: Helps mitigate logit drift in the final softmax layer by regularizing the partition function.

Evaluation

Chameleon demonstrates impressive capabilities across an array of tasks.

Image-to-Text and Visual Question Answering

The model shows strong performance in image captioning on COCO and Flickr30k datasets, as well as visual question answering with VQAv2 benchmarks. Here are some notable results:

COCO Captioning: Outperformed Flamingo-80B and IDEFICS-80B models.
VQAv2: Achieved competitive scores with other fine-tuned models like Flamingo-80B-FT and IDEFICS-80B-Instruct.

Text-Only Tasks

Chameleon holds its ground on text-only tasks as well. It performs admirably on commonsense reasoning and reading comprehension benchmarks such as PIQA, SIQA, and HellaSwag. For world knowledge and math problems, it also shows strong results, especially on GSM8k and MATH benchmarks, rivaling or surpassing models like LLaMa-2 and Mixtral 8x7B.

Practical Implications

Chameleon's unified approach can be transformative in areas requiring seamless integration of text and imagery, such as:

Content Creation: Generation of mixed-modal content with coherent, interleaved text and images.
Visual Question Answering: Enhancing interactive AI systems that can answer queries about visual content.
Educational Tools: Improving educational applications that explain concepts using a combination of images and text.

Future Directions

Chameleon's architecture and training strategies offer a robust foundation, but there are areas ripe for further exploration:

Fine-tuning: More targeted fine-tuning could enhance performance on specific downstream tasks.
Expansion to Other Modalities: Incorporating additional data types such as audio or video tokens could make the models even more versatile.
Optimization for Real-World Applications: Fine-tuning to improve robustness and efficiency in real-world, multimodal applications.

In summary, Chameleon offers a promising glimpse into the future of multimodal AI, blending textual and visual data in ways that existing models haven't. Its token-based, unified architecture could be a step towards more intelligent and integrated AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ArmenAgha/status/1791275538625241320

https://twitter.com/AIatMeta/status/1791263344714014733

https://twitter.com/arankomatsuzaki/status/1791289342121455993

https://twitter.com/iScienceLuvr/status/1791282467107651961

https://twitter.com/rohanpaul_ai/status/1803220079414325588

https://twitter.com/qtnx_/status/1866234572762906671