Emergent Mind

Abstract

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of LLMs to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

Class-conditional (top) and text-conditional (bottom) image generation samples using vanilla autoregressive models.

Overview

  • The paper explores the capabilities of vanilla autoregressive models, specifically those using the Llama architecture, in generating high-quality images, challenging the dominance of diffusion models.

  • Key contributions include the development of an advanced image tokenizer, scalable class- and text-conditional image generation models, and optimization techniques for inference speed, demonstrating superior performance in various benchmarks.

  • Extensive experimental evaluations reveal the strengths and potential limitations of the approach, underscoring the importance of high-quality data and optimized model scaling in achieving state-of-the-art results.

Analyzing the Potential of Vanilla Autoregressive Models in the Domain of Scalable Image Generation

The paper "Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation" by Peize Sun et al. investigates the capabilities of vanilla autoregressive models, specifically those using the Llama architecture, in generating high-quality images. The research answers the pivotal question of whether autoregressive models without inductive biases on visual signals can outperform the widely-used diffusion models in the image generation domain if appropriately scaled.

Key Contributions

The authors made several significant contributions which are summarized as follows:

Image Tokenizer:

  • Developed an image tokenizer capable of a downsample ratio of 16, which achieves a reconstruction quality of 0.94 rFID and 97% codebook usage on the ImageNet benchmark.
  • With a downsample ratio of 8, the tokenizer demonstrates competitive performance, displaying that discrete representation is no longer a bottleneck in image reconstruction.

Scalable Image Generation Models:

  • Introduced a series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving a FID of 2.18 on ImageNet's 256×256 benchmarks, thereby outperforming popular diffusion models such as LDM and DiT.

High-Quality Training Data:

  • Developed a text-conditional image generation model with 775M parameters. This model, trained on a subset of LAION-COCO and refined with high aesthetics quality images, demonstrated competitive performance in terms of visual quality and text alignment.

Optimized Inference Speed:

  • Verified the efficacy of LLM serving frameworks, such as vLLM, in optimizing the inference speed for image generation, achieving a speedup of 326% to 414%.

Experimental Evaluation

The experimental evaluation was thorough and well-documented, elucidating the strengths and potential limitations of the approach.

Image Tokenizer Assessment:

  • The paper details extensive ablation studies on codebook designs and the impacts of varying token numbers representing images. For instance, reducing the codebook vector dimension from 256 to 8 consistently improved reconstruction quality and codebook usage, highlighting the impact of codebook design on performance.

Class-conditional Image Generation:

  • The scalability of model sizes was explored, showing consistent improvements in FID scores when scaling models up to 3.1B parameters.
  • The role of classifier-free guidance (CFG) in enhancing visual quality was analyzed, identifying an optimal CFG setting of 2.0 for the best visual quality, balancing diversity and fidelity.

Text-conditional Image Generation:

Implications and Future Prospects

The findings have profound implications for the community. By demonstrating that vanilla autoregressive models can not only serve as a basis for advanced image generation systems but also meet or surpass the performance of diffusion models, the research sets a precedent for revisiting older architectures like autoregressive models under new scaling guidelines.

Practical Implications:

  • The open-source release of the models and codes fosters further research and development in the visual generation and multimodal foundation models, potentially accelerating advancements in these areas.

Theoretical Implications:

  • The success in reducing inductive biases while achieving state-of-the-art performance suggests a potential shift in the paradigm for future research on unified models combining language and vision tasks.
  • The paper opens avenues for leveraging language model techniques in image generation, encouraging investigations into more sophisticated image tokenizers and larger training datasets to scale models beyond current limitations.

Conclusion

This research underscores the potential dormant in autoregressive models and presents a methodical approach to scaling, optimizing, and evaluating these models for robust image generation. Though the initial results are promising, the paper underscores the need for larger datasets and computational resources to push the boundaries further. The work serves as a significant step towards unifying language and vision under a single modeling paradigm, paving the way for more versatile and scalable AI models in the future. The consequent impact on both practical applications and theoretical exploration in AI research is likely to be substantial.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.