Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 470 tok/s Pro
Claude Sonnet 4 40 tok/s Pro
2000 character limit reached

Zero-Shot Text-to-Image Generation (2102.12092v2)

Published 24 Feb 2021 in cs.CV and cs.LG

Abstract: Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Citations (4,186)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates a zero-shot approach where a 12B parameter autoregressive transformer generates images from text without task-specific training.
  • It employs a discrete VAE to efficiently encode 256x256 images into a 32x32 grid, reducing computational load while preserving visual quality.
  • Experiments reveal superior human-evaluated realism on MS-COCO with competitive FID scores, highlighting strong visual and semantic generalization.

Zero-Shot Text-to-Image Generation

The paper "Zero-Shot Text-to-Image Generation" introduces an approach based on a transformer model for generating images from text descriptions without the need for task-specific training data. The focus is on demonstrating competitive performance with existing models specifically trained on particular datasets, accomplishing tasks in a zero-shot manner using a large autoregressive transformer.

Methodology

Autoregressive Transformer Architecture

The approach employed involves using a transformer model to autoregressively process combined data streams of text and image tokens. The core architecture leverages a 12-billion parameter transformer which models text and image pairs through training on a massive dataset of 250 million image-text pairs collected from the internet.

To handle high-resolution images efficiently, a discrete VAE (dVAE) is utilized. The VAE compresses each 256×256256 \times 256 RGB image into a 32×3232 \times 32 grid of image tokens, significantly reducing the computational demands on the transformer without severe loss of visual quality (Figure 1). Figure 1

Figure 1: Comparison of original images (top) and reconstructions from the discrete VAE (bottom). The encoder downsamples the spatial resolution by a factor of 8.

Training Strategy

Training is conducted in two stages:

  1. Stage One: A discrete VAE encodes images into a latent space using a vocabulary of 8192 possible tokens.
  2. Stage Two: An autoregressive transformer is trained to model the joint distribution of text and image tokens.

The joint modeling uses evidence lower bound (ELB) maximization, effectively aligning the training objective with the system's generative capabilities, enabling the model to generalize to previously unseen tasks and data arrangements.

Data and Scale

The training dataset significantly surpasses standard benchmarks such as MS-COCO, both in size and variability, being specially curated to include Conceptual Captions and subsets from YFCC100M. This large-scale data provides the necessary semantic and visual diversity required to train a general-purpose model that can perform tasks zero-shot.

Experiments and Results

Image Quality and Human Evaluation

The system's performance is evaluated on zero-shot text-to-image synthesis tasks compared to prior models, demonstrating superior visual realism and diversity as judged by human evaluators. In tests against models like DF-GAN, the trained transformer outperformed in terms of quality on MS-COCO captions, with preferences from human evaluations favoring it 90% of the time (Figure 2). Figure 2

Figure 2: Human evaluation of our model vs prior work (DF-GAN) on captions from MS-COCO.

Quantitative Metrics

While the trained model achieves competitive Fréchet Inception Distance (FID) scores on MS-COCO, it trails on more specialized datasets like CUB (Figure 3). These results suggest constraints linked to the diversity of the training data and hint at areas for possible model improvement through fine-tuning. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Quantitative results on MS-COCO and CUB.

Implementation Considerations

Computational Efficiency

Training such a large-scale model demands significant computational resources. The implementation employs per-resblock gradient scaling and smart parameter sharding to facilitate mixed-precision training, effectively managing memory and computational overhead across distributed systems (Figure 4). Figure 4

Figure 4: Communication patterns used for distributed training.

Technical Challenges

Handling underflow in gradient computations and training large transformers in 16-bit precision presented technical challenges that were solved through innovative techniques like PowerSGD for gradient compression without sacrificing model performance significantly. The robustness of the implementation to such issues can provide insights for future large-scale model deployments.

Conclusion

The paper illustrates how scaling both model size and data, combined with advanced tokenization and training strategies, can achieve zero-shot capabilities in text-to-image generation. The model's adaptability to perform new tasks without additional specific training paves the way for further research into foundational models that can generalize across a wide range of tasks and domains. Future directions could involve fine-tuning for better task-specific performance or refining data handling techniques to further boost the model's zero-shot generalization efficacy.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube