Emergent Mind

Abstract

Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.

Overview

  • Diffusion models have become a pivotal technology in image and video synthesis, offering an alternative to GANs but are hindered by their slow inference process.

  • Adversarial Diffusion Distillation (ADD) aimed to accelerate diffusion models but was limited by expensive optimization and low-resolution constraints.

  • Latent Adversarial Diffusion Distillation (LADD) overcomes these limitations by working in the latent space, enhancing efficiency, and supporting high-resolution, diverse image synthesis.

  • LADD has been applied to Stable Diffusion 3 to create SD3-Turbo, achieving high-quality image generation from text prompts in fewer steps, and setting a benchmark for future research and practical applications in real-time image synthesis.

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Introduction

The advent of diffusion models has marked a significant advancement in image and video synthesis, offering an alternative to GANs for generating realistic and diverse samples. However, these models are not without their drawbacks, the most notable being their necessitation of multiple network evaluations during inference, which considerably slows down the process. This limitation obstructs real-time applications and has spurred research into methods for accelerating diffusion models. Among these, adversarial diffusion distillation (ADD) emerged as a promising approach for single-step image synthesis but faced obstacles related to expensive optimization, pixel-based operations, and restrictions in discriminator training resolution.

Advancements in Diffusion Distillation

Enter Latent Adversarial Diffusion Distillation (LADD), a novel methodology that addresses the shortcomings of ADD by employing latent space distillation. Unlike its predecessor, which relied on pixel-based operations, LADD operates within a model's latent space. This adjustment not only simplifies the training setup but also extends the capability of the distillation process to accommodate high-resolution and multi-aspect ratio image synthesis.

LADD employs a two-pronged approach: unifying the discriminator and teacher model roles and utilizing synthetic data for training. This strategy results in several benefits:

  • Efficiency & Simplification: By bypassing the need for pixel space decoding, LADD introduces a more resource-efficient approach that simplifies the overall system architecture.
  • Control Over Discriminator Features: It offers a natural way to adjust the feedback provided by the discriminator, influencing whether more global or local image features are emphasized during training.
  • Improved Performance: LADD demonstrates superior performance to ADD and other single-step approaches across various metrics and applications, from high-resolution image generation to tasks like image editing and inpainting.

Practical Applications and Results

The application of LADD to Stable Diffusion 3 (SD3), dubbed SD3-Turbo, encapsulates the method's potential. SD3-Turbo can match the image quality of its teacher model in merely four unguided sampling steps, showcasing the efficacy of LADD in generating high-resolution, multi-aspect ratio images from text prompts. The paper also explore systematic studies of LADD’s scaling behavior and its adaptability to various practical applications, confirming its versatility and effectiveness.

Future Implications and Research Directions

The development and implementation of LADD signify a substantial step forward in the distillation of diffusion models, enabling the generation of high-quality images in a fraction of the time previously required. This breakthrough could have notable implications in fields requiring rapid image synthesis, such as real-time image editing, video game development, and augmented reality applications.

Moreover, the success of LADD points toward fertile ground for future research, particularly in exploring the scalability of adversarial models within the constraints of current hardware and further refining the synthetic data generation process to enhance text-image alignment in generated outputs.

Conclusion

Latent Adversarial Diffusion Distillation represents a significant advancement in the field of image synthesis. By resolving key limitations associated with predecessor methods, LADD stands as a testament to the potential of leveraging latent spaces for efficient, high-quality image generation. As the community continues to build upon these findings, the horizon looks promising for the future development of faster, more versatile diffusion models capable of meeting the increasing demand for real-time, high-resolution image synthesis across various domains.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube