Emergent Mind

Autoregressive Image Generation without Vector Quantization

(2406.11838)
Published Jun 17, 2024 in cs.CV

Abstract

Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications.

Examples of class-conditional generation on ImageNet 256x256 using MAR-H with Diffusion Loss.

Overview

  • The paper introduces a novel approach to autoregressive image generation that bypasses the need for vector quantization by using a continuous-valued token space and a diffusion-based loss function called Diffusion Loss.

  • The authors detail their methodology, which involves utilizing diffusion models to predict vector representations for each token and applying the diffusion process during both training and inference stages.

  • Experimental results demonstrate substantial improvements in image quality metrics, such as frequency inception distance (FID), and highlight the flexibility of the approach across different tokenizer configurations.

An Analytical Overview of "Autoregressive Image Generation without Vector Quantization"

The paper "Autoregressive Image Generation without Vector Quantization" by Tianhong Li et al. proposes a novel methodology for image generation, challenging the conventional approach that relies on discrete-valued vector-quantized representations. The authors present a diffusion-based loss function, named Diffusion Loss, enabling autoregressive models to operate in continuous-valued token spaces. This paradigm shift facilitates the elimination of discrete tokenizers, potentially addressing several inherent limitations associated with vector quantization.

Theoretical Framework

Autoregressive models have predominantly been associated with discrete token spaces, as evidenced by their prolific use in NLP tasks. Extending these models to continuous domains, such as images, has commonly necessitated the use of vector quantization to discretize data. This paper questions the necessity of such an approach.

The authors propose utilizing diffusion models to represent per-token probability distributions. Diffusion models, traditionally employed for denoising processes and self-supervised learning, are here adapted to autoregressively predict a vector (z) for each token and use diffusion-based denoising to represent the distribution (p(x|z)). The methodology eliminates categorical cross-entropy loss typically applied for discrete-value prediction and introduces a Diffusion Loss function instead.

Methodology

The proposed Diffusion Loss is formulated leveraging denoising diffusion principles. Specifically, the loss function measures the accuracy of predicted noise in a sequence, which is then used to conditionally sample continuous-valued tokens. The model applies the diffusion process during both training and inference phases, supporting gradient backpropagation and enabling flexible sampling strategies, such as temperature-based controls for diversity.

The paper further explores an advanced framework by unifying standard autoregressive (AR) models with masked generative models (MAR). This framework capitalizes on bidirectional transformer architectures, offering enhanced communication pathways across tokens compared to causal transformers conventionally used in autoregressive setups.

Experimental Validation

A series of rigorous experiments demonstrate the efficacy of Diffusion Loss as applied to both AR and MAR models. Noteworthy improvements are documented in the frequency inception distance (FID) scores: the MAR model with Diffusion Loss achieves a remarkable FID improvement from 8.75 to 3.43 when transitioning from causal to bidirectional attention mechanisms, and from 13.07 to 3.43 when shifting from raster to random order generation under bidirectional settings.

Moreover, the flexibility of Diffusion Loss allows the integration of various tokenizers beyond vector-quantized formats, including the KL-divergence regularized models and high stride tokenizers. The experiments reveal the superior performance and flexibility of the presented approach across different tokenizer architectures and configurations.

Strong Numerical Results and Implications

The paper provides compelling numerical evidence of the model's performance:

  • The Masked Autoregressive (MAR) large model with Diffusion Loss achieves an FID of 2.60 without classifier-free guidance (CFG) and improves to 1.78 with CFG on ImageNet 256x256.
  • The method shows rapid image generation, achieving a rate of less than 0.3 seconds per image while maintaining a strong FID of less than 2.0 in MAR settings.

The theoretical implications are significant. The decoupling of autoregressive models from discrete token spaces opens new avenues for using continuous representations. Practically, this approach promises enhanced model robustness, efficiency in handling gradients, and streamlined training procedures by avoiding the complexities of discrete quantization.

Future Directions and Speculative Insights

The research highlights the uncharted potential of combining autoregressive token interdependency with token diffusion modeling. This framework suggests promising extensions in various continuous domains, ranging from enhanced image generation to potentially pioneering applications in video generation or other high-dimensional continuous data forms. Given the inherent flexibility and scalability demonstrated, future investigations could focus on improving tokenizers' efficiency, exploring larger model architectures, and applying this framework to broader datasets and tasks beyond image generation.

Conclusion

The paper "Autoregressive Image Generation without Vector Quantization" by Tianhong Li et al. lays a foundational framework that challenges existing paradigms in autoregressive image generation. By introducing Diffusion Loss and extending autoregressive models to continuous-valued domains, it provides a robust, scalable, and efficient methodology. This significant contribution is poised to stimulate further research and application development within the field of artificial intelligence, particularly in generative modeling.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube