Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (2205.11487v1)

Published 23 May 2022 in cs.CV and cs.LG

Abstract: We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer LLMs in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic LLMs (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the LLM in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

Citations (4,904)

View on Semantic Scholar

Summary

The paper demonstrates that large frozen language models, like T5-XXL, enhance image fidelity and text-image alignment.
It introduces dynamic thresholding and an Efficient U-Net architecture to refine the diffusion process, achieving an FID score of 7.27 on COCO.
The authors launch DrawBench to assess semantic nuances, with human evaluations preferring Imagen’s outputs over previous models.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Overview

The paper introduces "Imagen," a text-to-image diffusion model that excels in photorealism and language comprehension by leveraging large transformer LLMs. Imagen utilizes a frozen T5-XXL encoder for text input and follows a diffusion-based approach to generate high-fidelity images incrementally. The process includes generating a $64\times64$ image and refining it to $256\times256$ and %%%%2%%%% resolutions. The paper highlights that increasing the size of the LLM significantly enhances both sample fidelity and image-text alignment compared to scaling the image diffusion model. Imagen sets a new state-of-the-art FID score of 7.27 on the COCO dataset without training on COCO, and human raters regard Imagen's samples as on par with COCO images regarding image-text alignment.

Key Contributions

Discovery of Effective Text Encoders: The authors demonstrate that large frozen LLMs, such as T5-XXL, are surprisingly effective text encoders for text-to-image synthesis. This finding emphasizes the advantage of scaling LLMs to improve the quality and alignment of generated images.
State-of-the-Art Performance: Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset. This is significant as it surpasses prior models including GLIDE and the concurrent DALL-E 2, even though Imagen was not trained on the COCO dataset.
Impact of Dynamic Thresholding: Dynamic thresholding during sampling allows the use of high guidance weights without degrading sample quality. This technique results in more photorealistic and detailed images, addressing the common issue of image oversaturation in models using high guidance weights.
Design of Efficient U-Net: The paper introduces the Efficient U-Net architecture for the diffusion models, enhancing memory efficiency, convergence speed, and overall performance. This architecture shifts parameters to lower resolutions and modifies the order of downsampling and upsampling operations for faster inference.
Introduction of DrawBench: To evaluate text-to-image models comprehensively, the authors introduce DrawBench, a structured suite of prompts designed to probe various semantic properties such as compositionality, cardinality, and complex scene generation. According to human evaluations, Imagen outperforms other recent methods by a significant margin.

Results and Analysis

Performance on COCO Dataset

Imagen's zero-shot FID-30K score of 7.27 on COCO significantly outperforms previous models such as GLIDE (12.4) and DALL-E 2 (10.39).
Human evaluations indicate that Imagen's generated images have high fidelity and alignment with text descriptions, scoring 91.4 in image-text alignment, comparable to original COCO images.

Evaluation with DrawBench

DrawBench evaluations show that human raters significantly prefer Imagen's outputs over those of other models like GLIDE and DALL-E 2.
Imagen demonstrated robustness across various categories such as colors, spatial relations, and handling complex and creative prompts.

Implications and Future Developments

The findings in this paper have several practical and theoretical implications. The effectiveness of large frozen LLMs as text encoders suggests that future research in text-to-image synthesis should focus on leveraging and possibly further scaling these models. The introduction of dynamic thresholding opens up the possibility of more realistic image generation without compromising quality. Efficient U-Net architecture highlights the need for optimized model architectures that can deliver superior performance with reduced computational cost.

In future developments, the techniques and findings from Imagen can be extended to other domains such as video generation, multimodal understanding, and interactive AI systems. Moreover, further investigations into the ethical implications and biases in training data, as mentioned in the paper, are critical to ensure the responsible deployment of such generative technologies. Addressing these concerns will be crucial for integrating models like Imagen into user-facing applications.

Conclusions

The paper provides a comprehensive approach to enhancing text-to-image synthesis using diffusion models and LLMs. It sets a new benchmark in the field with significant improvements in sample fidelity and alignment. The innovations introduced, such as dynamic thresholding and Efficient U-Net, along with the rigorous evaluation via DrawBench, contribute valuable insights and methodologies that can drive future research and applications in AI-driven image generation.