Emergent Mind

Abstract

We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Code and models are released at https://github.com/showlab/Show-o.

Overview

  • Show-o presents a novel unified transformer architecture that integrates multimodal understanding and generation within a single model, leveraging both autoregressive and discrete diffusion modeling for efficient and versatile performance.

  • The model employs innovative techniques such as discrete image tokenization and an omni-attention mechanism to handle diverse vision-language tasks seamlessly, demonstrating superior or competitive results across various benchmarks.

  • Show-o's ability to reduce sampling steps significantly and support diverse downstream applications without fine-tuning highlights its practical value and implications for future multimodal systems.

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

In recent years, significant strides have been made in multimodal understanding and generation. However, the conventional approaches tend to tackle these two domains separately—utilizing distinct models for tasks such as visual question answering (VQA) and text-to-image generation. This paper proposes a novel unified transformer, termed Show-o, aiming to integrate multimodal understanding and generation within a single model framework.

Key Contributions

  1. Unified Transformer Architecture: Show-o employs a unified transformer architecture that encapsulates both autoregressive and discrete diffusion modeling. The model integrates functionalities for image and text processing, thereby handling diverse vision-language tasks seamlessly.
  2. Discrete Representation Modeling: Instead of relying solely on continuous image representations, Show-o innovatively models discrete image tokens, leveraging a discrete denoising diffusion strategy. This hybrid approach capitalizes on the strengths of both autoregressive text processing and continuous image generation.
  3. Comprehensive Performance: Across various benchmarks, Show-o demonstrates performance metrics on par with, or exceeding, those achieved by specialized individual models. This includes tasks like VQA, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation.
  4. Accelerated Sampling: Notably, the model necessitates approximately 20 times fewer sampling steps compared to purely autoregressive models for image generation, underlining its efficiency.
  5. Versatility in Downstream Applications: Show-o inherently supports diverse downstream applications without necessitating any fine-tuning, such as text-based inpainting and extrapolation, as well as mixed-modality generation including interleaved video keyframe generation with text descriptions.

Methodology

The methodology underpinning Show-o encompasses several critical components:

Tokenization and Input Formatting

The model utilizes both text and image tokenizers to convert input data into discrete tokens. This allows for a unified processing framework that coherently handles various modalities. A unified prompting strategy ensures that different types of input data are formatted into structured sequences.

Omni-Attention Mechanism

Central to Show-o's architecture is the omni-attention mechanism, which adaptively employs causal attention for text tokens and full attention for image tokens within a unified sequence. This mixed-attention strategy enables efficient and coherent processing of multimodal data.

Training Objectives

Show-o is trained using a combination of Next Token Prediction (NTP) and Mask Token Prediction (MTP) objectives. This training regime ensures that the model can perform both autoregressive next-token generation and discrete diffusion for masked token prediction, thereby facilitating both multimodal understanding and generation tasks.

Multi-Stage Training Pipeline

The training pipeline is designed to progressively and effectively align the model for multimodal tasks. The stages include:

  1. Initial pre-training on image token embedding and pixel dependency learning.
  2. Image-text alignment for multimodal tasks.
  3. Fine-tuning with high-quality datasets for specific multimodal tasks.

Experimental Results

Multimodal Understanding: When benchmarked against state-of-the-art models such as LLaVA-v1.5, InstructBLIP, and mPLUG-Owl2, Show-o achieves competitive or superior results across metrics including POPE, MME, Flickr30k, and VQAv2.

Visual Generation: On the MSCOCO benchmark, Show-o outperforms several larger models in terms of zero-shot FID, highlighting its efficacy in generation tasks. Similar competitive performance is evidenced on the GenEval benchmark across dimensions of object composition, color attributes, and scene positioning.

Mixed-Modality Generation: Show-o extends its capabilities to mixed-modality generation, producing consistent video keyframes conditioned on textual descriptions. This provides a promising approach for long-form video generation applications.

Implications and Future Work

The integration of autoregressive and diffusion modeling within a unified transformer architecture paves the way for more versatile and efficient multimodal models. The ability to handle a wide array of vision-language tasks within a single model holds significant practical value, particularly for applications requiring consistent and coherent output across different modalities.

Future work could focus on scaling the model further to enhance performance and explore the potential of Show-o in more complex multimodal scenarios, such as continuous video generation and interactive multimodal systems.

Conclusion

Show-o represents a significant step towards unified multimodal models, demonstrating that a single transformer can effectively handle both understanding and generation tasks. The novel integration of autoregressive and discrete diffusion modeling within this framework paves the way for more efficient and versatile multimodal systems, potentially setting a new foundation for future advancements in this domain.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.