Scaling up GANs for Text-to-Image Synthesis

Published 9 Mar 2023 in cs.CV, cs.GR, and cs.LG | (2303.05511v2)

Abstract: The recent success of text-to-image synthesis has taken the world by storm and captured the general public's imagination. From a technical standpoint, it also marked a drastic change in the favored architecture to design generative image models. GANs used to be the de facto choice, with techniques like StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new standard for large-scale generative models overnight. This rapid shift raises a fundamental question: can we scale up GANs to benefit from large datasets like LAION? We find that na\"Ively increasing the capacity of the StyleGAN architecture quickly becomes unstable. We introduce GigaGAN, a new GAN architecture that far exceeds this limit, demonstrating GANs as a viable option for text-to-image synthesis. GigaGAN offers three major advantages. First, it is orders of magnitude faster at inference time, taking only 0.13 seconds to synthesize a 512px image. Second, it can synthesize high-resolution images, for example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various latent space editing applications such as latent interpolation, style mixing, and vector arithmetic operations.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (368)

View on Semantic Scholar

Summary

The paper introduces GigaGAN, a scalable GAN that advances text-to-image synthesis by reducing inference times compared to diffusion models.
It employs a sample-adaptive convolution with interleaved self- and cross-attention to enhance image-text alignment and capture fine details.
GigaGAN generates 512px images in 0.13 seconds and 16MP images under 4 seconds, achieving a FID score of 9.09 on the COCO2014 dataset.

Scaling up GANs for Text-to-Image Synthesis

The paper entitled "Scaling up GANs for Text-to-Image Synthesis" presents an ambitious effort to reestablish Generative Adversarial Networks (GANs) as a competitive alternative to the predominantly used diffusion and autoregressive models in the domain of high-quality image synthesis from textual descriptions. The introduction of a new architecture, GigaGAN, marks a seminal point in testing the scalability and applicability of GANs against their modern counterparts while focusing on the key aspects of computational efficiency, scalability, and functionality in latent space manipulation.

Primary results demonstrate GigaGAN’s ability to execute image synthesis significantly faster than diffusion models with typically higher computational loads, achieving a remarkable inference time of 0.13 seconds for generating high-quality 512px images. Such efficiency is notably advantageous in scenarios requiring interactive application speeds. Furthermore, GigaGAN supports image generation at 16-megapixel resolution in under four seconds while maintaining commendable quality scores, thereby serving as a potent tool for applications demanding high-resolution outputs.

The proposed GigaGAN architecture addresses several limitations noted in scaling traditional GANs. It incorporates a sample-adaptive convolution filter mechanism that dynamically adjusts the convolutional kernels based on input conditions, effectively enhancing the GAN's capability to handle the diverse complexity of large-scale image datasets such as LAION2B-en. The infusion of both self-attention and cross-attention mechanisms interleaved with traditional convolutional layers further bolsters the GAN's ability to synthesize contextually aware images that delineate fine details and maintain strong image-text alignment.

Regarding comparative performance, GigaGAN is contrasted with state-of-the-art diffusion models like DALL·E 2 and Stable Diffusion. GigaGAN exhibits a lower zero-shot Fréchet Inception Distance (FID) score of 9.09 on the COCO2014 dataset, indicating superior capacity in generating visually convincing images. While diffusion models have demonstrated unmatched flexibility and alignment with text inputs, GigaGAN’s performance underscores its potential viability in achieving similar standards of imagery with considerably reduced computational overhead.

In addition to delivering considerable improvements in traditional image synthesis applications, the GigaGAN model makes notable strides in latent space editing capabilities—a distinctive feature linked strongly to GANs. The model supports operations like latent interpolation, style mixing, and prompt-based manipulation across text embeddings, hence extending opportunities to innovatively manipulate style and content in synthesized images.

The paper does not omit exploration into the applicability of GigaGAN in super-resolution tasks. Through a GAN-based upsampling mechanism, it competes with supervised counterparts like Real-ESRGAN, demonstrating remarkable capability for generating high-resolution images far exceeding traditional pixel-driven methods.

The implications of this research straddle both theoretical and practical realms, prompting reevaluation of GANs in expansive text-to-image tasks traditionally dominated by computationally intensive models. The nuanced modifications to GAN frameworks address prior stability and scaling challenges, thus opening avenues for further research into larger architectures, dataset integration, and enhanced training protocols. As GANs like GigaGAN show that they can effectively scale, researchers may find renewed incentive to invest in and optimize GAN-based methodologies within the field of generative modeling.

In conclusion, GigaGAN reflects a methodological advance for GANs within the context of large-scale and high-quality text-to-image synthesis. The model not only scales up effectively but also reintroduces the sector to the benefits of speed and style manipulation inherent in GAN architectures. Future research endeavors can build upon these findings to explore broader datasets, higher resolution synthesis, and even more efficient architectures, perhaps bridging the gap further between GANs and today's prevailing diffusion and autoregressive models.

Markdown Report Issue