Emergent Mind

Distilling Diffusion Models into Conditional GANs

(2405.05967)
Published May 9, 2024 in cs.CV , cs.GR , and cs.LG

Abstract

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.

Diffusion2GAN uses different random seeds to produce a variety of images matching specific prompts.

Overview

  • The paper presents a method to distill complex diffusion models into more efficient Conditional Generative Adversarial Networks (Conditional GANs), aiming to preserve image quality while significantly accelerating the image generation process.

  • New techniques, including a perceptual loss function called E-LatentLPIPS, facilitate effective training and enhance performance within the condensed model, which has been named Diffusion2GAN.

  • Diffusion2GAN achieves remarkable improvements in speed and maintains image detail and diversity, delivering superior performance on benchmarks such as the zero-shot COCO, and potentially expanding to other domains like video and 3D content creation.

Simplifying AI: Distilling Diffusion Models into Conditional GANs

Overview of Distillation Process

Diffusion models are well-regarded for their ability to create high-quality images, but they often suffer from slow generation times due to their multi-step processes. This paper introduces a promising technique that involves distilling a complex diffusion model into a more streamlined Conditional Generative Adversarial Network (Conditional GAN). This approach aims not only to maintain the image quality but significantly speed up the image generation process.

Key Concepts and Methods

Distillation as Image-to-Image Translation

  • The process begins by treating diffusion distillation like an image-to-image translation, where the target (teacher) outputs are paired with specific input noise configurations.
  • This approach leverages the strengths of both model types: the diffusion model's ability to find high-quality image-to-noise correspondences and the GAN's rapid generation capabilities.

Innovation with E-LatentLPIPS

  • An innovative aspect of this paper is the introduction of E-LatentLPIPS, a perceptual loss function designed to operate directly in the latent space of diffusion models. This eliminates the need for costly pixel-space computations.
  • Using E-LatentLPIPS, the model achieves effective training and high performance with lower computational demands compared to traditional methods.

A Two-Pronged Approach with GANs

  • The adoption of a Conditional GAN framework further refines the distillation process, helping to map the intricate relationships found in the training data more efficiently.
  • The newly designed discriminator enhances the model's ability to ensure fidelity and text-image correspondence, which is key in tasks like text-to-image synthesis.

Results and Implications

The proposed model, termed as Diffusion2GAN, outperforms several state-of-the-art one-step diffusion models in terms of image quality and generation speed. This model demonstrates a significant reduction in generation time—from several seconds down to approximately 0.09 seconds for an image—while maintaining the fine details and diversity of the generated images.

  • Quantitative Success: On benchmarks like the zero-shot COCO, Diffusion2GAN scores better FID and CLIP-score values compared to other advanced models, indicating superior image quality and relevance.
  • Quality and Performance: This model not only matches but in some cases, surpasses the image quality of the uncondensed diffusion models, providing a balance between speed and quality that could revolutionize real-time image generation applications.

Future Potential

Considering the impressive results achieved, the technique opens up exciting possibilities:

  • Expansion to Other Areas: Beyond image generation, this technique could lead the way in video and 3D content creation, where speed and computational efficiency are crucial.
  • Further Optimization: With ongoing advancements in GAN architectures and training methodologies, future iterations could see even greater efficiency and quality improvements.

Conclusion

This study offers a substantial step forward in efficient image synthesis, providing a blueprint for integrating the quality of diffusion models with the speed of GANs through innovative distillation methods. As AI continues to evolve, such fusion techniques will likely become fundamental in overcoming the challenges of model efficiency and quality in generative tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.