Papers
Topics
Authors
Recent
2000 character limit reached

Autoregressive Image Generation using Residual Quantization (2203.01941v2)

Published 3 Mar 2022 in cs.CV and cs.LG

Abstract: For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a 256$\times$256 image as 8$\times$8 resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework outperforms the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images.

Citations (233)

Summary

  • The paper introduces a novel two-stage AR framework leveraging Residual-Quantized VAE to overcome traditional VQ limitations and enhance image fidelity.
  • The RQ-Transformer employs spatial and depth transformers to efficiently model and predict stacked discrete codes for high-resolution images.
  • Experimental results demonstrate state-of-the-art performance with improved FIDs, IS scores, and up to 7.3x faster sampling compared to existing methods.

Autoregressive Image Generation using Residual Quantization

Introduction

This paper presents a novel two-stage framework for autoregressive (AR) image generation leveraging Residual Quantization (RQ). The primary motivation is to address the limitations of traditional Vector Quantization (VQ), which poses a trade-off between reducing the sequence length of codes and maintaining image fidelity due to the rate-distortion constraint. The proposed framework consists of the Residual-Quantized VAE (RQ-VAE) and the RQ-Transformer, designed to generate high-resolution images efficiently.

Methodology

Residual-Quantized VAE

The RQ-VAE replaces traditional VQ with a residual quantization technique to overcome the limitations of the codebook size. RQ quantizes a feature map of an image into a stack of discrete codes, allowing for precise approximation without requiring a massive codebook. This is achieved by recursively quantizing the residual errors, resulting in a coarse-to-fine approximation of the feature map.

The RQ process can be defined as follows:

  1. Start with the feature vector zz and initialize the residual vector r0=zr_0 = z.
  2. For each depth dd, compute the nearest code kdk_d from a shared codebook and the new residual rdr_d.
  3. Accumulate the embeddings of these codes to approximate the original vector.

The RQ-VAE encodes an image into a reduced resolution feature map, enabling more efficient AR modeling of images. Figure 1

Figure 1: An overview of our two-stage image generation framework composed of RQ-VAE and RQ-Transformer. In stage 1, RQ-VAE uses the residual quantizer to represent an image as a stack of D=4D=4 codes. After the stacked map of codes is reshaped, RQ-Transformer predicts the DD codes at the next position.

RQ-Transformer

The RQ-Transformer is designed to efficiently predict the quantized feature vectors provided by RQ-VAE. It utilizes a combination of spatial and depth transformers to handle the reduced sequence length efficiently.

  • Spatial Transformer: Processes the sequence of input feature maps to generate context vectors, capturing information from previous positions.
  • Depth Transformer: Autoregressively predicts the multiple depth codes for each position, utilizing the context vectors generated by the spatial transformer.

This architecture reduces computational complexity compared to na"ive transformer models, thanks to its design focused on handling shorter sequences more effectively.

Experimental Results

The proposed framework achieves superior performance on various benchmarks for both unconditional and conditional image generation tasks compared to existing AR models.

  • Unconditional Image Generation: Achieves lower FIDs on datasets such as LSUN and FFHQ, showcasing improved image quality and diversity. Figure 2

    Figure 2: Examples of our conditional generation for 256x256 images. The images in the first row are generated from the classes of ImageNet. The images in the second row are generated from text conditions.

  • Conditional Image Generation: Demonstrates significant improvements over prior methods on ImageNet with notable FID and IS scores, establishing state-of-the-art performance when combined with techniques like rejection sampling. Figure 3

Figure 3

Figure 3

Figure 3: Additional examples of conditional image generation by 1.4B parameters of RQ-Transformer trained on ImageNet.

Efficiency and Ablation Study

The computational efficiency of the RQ-Transformer is highlighted by achieving fast image generation speeds while maintaining high-quality output. Ablation studies reveal the impact of various architectural choices, such as the depth of residual quantization and using a shared codebook for all quantization steps.

  • Sampling Speed: Demonstrated up to 7.3x faster generation compared to previous AR models, especially as batch size increases. Figure 4

    Figure 4: The sampling speed of RQ-Transformer with 1.4B parameters according to batch size and code map shape.

  • Coarse-to-Fine Approximation: Validates the hypothesis that increasing depth in RQ-VAE ameliorates reconstruction quality, supporting high-fidelity generation. Figure 5

    Figure 5: The examples of coarse-to-fine approximation by RQ-VAE.

Conclusion

The integration of RQ-VAE and RQ-Transformer offers a robust solution for high-resolution image synthesis with AR models, addressing the limitations of traditional VQ methods. The precise approximation of feature maps and efficient modeling of sequences empower the framework to outperform existing models in terms of quality and speed on significant image generation benchmarks. Future work could further explore regularization techniques for small datasets and scaling models for text-conditioned generation tasks.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.