Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (2408.11039v1)

Published 20 Aug 2024 in cs.AI and cs.CV

Abstract: We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the LLMing loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a LLM over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and LLMs, reaping the benefits of both worlds.

Citations (58)

View on Semantic Scholar

Summary

The paper introduces Transfusion, a unified multi-modal model combining next-token prediction and diffusion for both text and image generation.
It demonstrates scalability improvements over traditional models like Chameleon by incorporating modality-specific layers such as U-Net blocks.
Evaluations on benchmarks including FID, CLIP, and CIDEr show Transfusion’s efficient integration of discrete and continuous data modalities.

An Analysis of "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model"

The paper presents a method named "Transfusion" which enables training a multi-modal model for both discrete and continuous data using transformers. Traditionally, generative models have been piecemealed, with LLMs dominating discrete modalities and diffusion models excelling in continuous modalities. This research proposes a novel integration by leveraging both a LLMing objective for text and a diffusion objective for images within a single transformer-based architecture.

Key Contributions

Unified Multi-modal Framework: Transfusion integrates text and image generation by combining next-token prediction with diffusion in a single architecture. This method is designed to function seamlessly across discrete and continuous data types, accommodating various generation tasks in one model.
Scalability and Performance Gains: The paper demonstrates that Transfusion scales efficiently, outperforming traditional models like Chameleon (an image quantization approach) in both discrete and continuous modalities. Notably, Transfusion achieves superior scaling laws, meaning it retains or improves performance with increased data and model size more efficiently than Chameleon.
Modality-Specific Enhancements: Experiments reveal that incorporating modality-specific encoding and decoding layers, such as U-Net blocks for images, further boosts performance. Larger image patches significantly reduce computation costs while maintaining or even improving model output quality.
Evaluation on Multiple Benchmarks: Transfusion's capability is comprehensively evaluated across several benchmarks. These include perplexity on text corpora (Wikipedia and C4), accuracy on the Llama evaluation suite, and FID/CLIP scores for text-to-image tasks. The model also shows efficiency in image-to-text generation, evidenced by high CIDEr scores on the MS-COCO dataset.

Experimental Findings

Controlled Comparison with Chameleon: - Using equivalent compute and training setups, Transfusion surpassed Chameleon on all benchmarks. For instance, text-to-image FID scores improved substantially with Transfusion needing significantly fewer FLOPs. - Transfusion also exhibited superior text generation capabilities, indirectly benefitting from the diffusion process due to more efficient parameter utilization.

Architectural Ablations: - Enabling intra-image bidirectional attention showed significant improvements, particularly in image quality (FID scores). - The model's performance with different patch sizes was evaluated, finding that larger patches could efficiently trade off computation while maintaining image quality, especially when using U-Net layers. - The U-Net layer as a patch encoder/decoder consistently outperformed simple linear layers, suggesting that the inductive biases of U-Nets are beneficial for high-quality image generation.

Large-scale Model Evaluation: - A scaled-up 7B parameter model was trained on a mix of 2T tokens comprising both text and image data. This model outperformed several state-of-the-art image generation models (e.g., SDXL, DeepFloyd) on the GenEval benchmark. - The model retained robust text generation abilities, akin to Llama models, validating Transfusion's versatility in handling both text and image modalities effectively.

Practical and Theoretical Implications

Practical: - Transfusion is a significant step towards developing more integrated and efficient multi-modal models, reducing the need for separate systems for text and image generation. - It has potential applications in areas requiring detailed and coherent inter-modal generation, such as creative content creation, automated reporting, and sophisticated AI-assisted design.

Theoretical: - This work furthers our understanding of how different model architectures and training objectives can be harmonized to accommodate diverse data types within a unified framework. - Investigating the reasons behind the better scaling properties of Transfusion could offer insights into new methods for optimizing large-scale generative models across various data modalities.

Future Research Directions

The paper hints at several promising research directions stemming from their findings: - Combination with Other Generative Techniques: Exploring advanced generative modeling frameworks, such as flow matching, could further enhance the model's capabilities. - Parameter Sharing and Optimization: Understanding the intricacies of parameter utilization between modalities could lead to more efficient multi-modal architectures. - Extended Modalities: Expanding Transfusion to include audio and video data, incorporating multi-modal interactions beyond text and image, would be a natural progression. - Adaptive Training Objectives: Dynamic tuning of the balancing coefficient ( $\lambda$ ) during training could optimize performance across tasks and datasets more granularly.

In conclusion, Transfusion presents a significant advancement in multi-modal generative modeling, setting a new benchmark in the seamless integration of discrete and continuous data generation within a unified framework. This work lays the foundation for future innovations in multi-modal AI, promising more cohesive and versatile generative models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/violet_zct/status/1826243212530610389

https://twitter.com/arankomatsuzaki/status/1826091405208604844

https://twitter.com/_akhaliq/status/1826089894764057085

https://twitter.com/VSehwag_/status/1826335900789592396

https://twitter.com/fly51fly/status/1827461723185721815

https://twitter.com/neurallambda/status/1836226809941364909