Emergent Mind

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

(2402.13929)
Published Feb 21, 2024 in cs.CV , cs.AI , and cs.LG

Abstract

We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.

Models of varying capacities learn multiple flows; lower capacity student models yield blurrier results.

Overview

  • The paper introduces a distillation method named SDXL-Lightning that improves text-to-image generation speed and quality by combining progressive and adversarial techniques.

  • SDXL-Lightning utilizes progressive distillation to maintain the model's original behavior and adversarial distillation for high-quality image generation, surpassing previous methods.

  • The results show SDXL-Lightning models achieve unprecedented efficiency and quality in text-to-image generation, with significant improvements over existing distillation methods.

  • The research opens new pathways for optimizing generative models for few-step generation processes and indicates the potential for extending this method to other domains.

Progressive Adversarial Diffusion Distillation for Efficient Text-to-Image Generation

Introduction

Generative models, particularly diffusion models, have shown remarkable capabilities in various domains like text-to-image and text-to-video generation. However, their slow, iterative generation process poses significant computational challenges. This paper introduces a distillation method combining progressive and adversarial techniques, aimed at striking a balance between image quality and mode coverage for one-step or few-step generative processes. The proposed approach, termed SDXL-Lightning, not only enhances the speed of image generation to new heights but also maintains, and in some cases surpasses, the quality produced by the state-of-the-art models.

Theoretical Foundations and Methodology

At the core of our method lies the fusion of progressive and adversarial distillation strategies, innovatively applied to diffusion models. Traditional approaches to reducing inference steps often led to unacceptable quality loss or required an impractically high number of steps to generate acceptable results. Our method, by contrast, leverages the strengths of both progressive and adversarial distillation to directly predict farther along the flow of generation, notably surpassing previous methods in producing high-quality images in fewer steps.

  • Progressive Distillation: We detail how progressive distillation can ensure the distilled model preserves the original ODE flow and mode coverage but struggles with image sharpness under few inference steps. The inclusion of progressive distillation assists in maintaining the original model behavior, making our distilled models compatible with various LoRA modules and control plugins.
  • Adversarial Distillation: The adoption of an adversarial loss mechanism at each distillation stage plays a crucial role in enhancing image quality. Instead of relying solely on mean squared error (MSE), which tends to produce blurry images, our method utilizes a pre-trained diffusion U-Net encoder as the discriminator backbone, fully operating in latent space. This approach allows for efficient distillation in high-resolutions while providing flexibility in balancing between sample quality and mode coverage.

Model Distillation and Results

Our distilled models, named SDXL-Lightning, exhibit unparalleled efficiency and quality in text-to-image generation, particularly at 1024px resolution. The models, open-sourced for both LoRA adaptation and full UNet weights, show significant improvements over existing distillation methods:

  • Efficiency and Quality: Our distillation procedure effectively reduces the required inference steps to as low as one or two while achieving new state-of-the-art results in quality, as evidenced by numerical scores in established metrics such as Fréchet Inception Distance (FID) and CLIP score.
  • Discriminator Design and Training Techniques: The innovative discriminator design, leveraging the pre-trained diffusion model’s encoder, along with strategic training techniques, ensures stable training and high-quality image generation.
  • Adaptability and Compatibility: The distilled models demonstrate remarkable compatibility with existing LoRA modules and control plugins, showcasing their potential for easy integration into various applications and further research explorations in generative AI.

Future Directions

While SDXL-Lightning sets a new benchmark in efficient text-to-image generation, future work will explore optimizing the architecture for few-step generation processes and extending the method’s applicability across different domains and modalities. The open sourcing of these distilled models is anticipated to catalyze further advancements in the field.

Concluding Remarks

The proposed progressive adversarial diffusion distillation method represents a significant leap forward in the efficiency of high-quality text-to-image generation. By meticulously combining progressive and adversarial distillation techniques and employing innovative training mechanisms, the resulting SDXL-Lightning models practically balance quality, efficiency, and mode coverage, offering vast potential for real-world applications and further scholarly inquiry.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube