Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

149 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

43 3

SDXL-Lightning: Progressive Adversarial Diffusion Distillation (2402.13929v3)

Published 21 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.

References (80)

Citations (75)

View on Semantic Scholar

Summary

The paper presents a progressive adversarial diffusion distillation method that reduces inference steps to one or two while preserving image quality.
The approach utilizes a pre-trained diffusion U-Net encoder as a discriminator to balance sample quality and mode coverage effectively.
The method demonstrates exceptional compatibility with LoRA modules and control plugins, setting new efficiency standards in high-resolution text-to-image generation.

Progressive Adversarial Diffusion Distillation for Efficient Text-to-Image Generation

Introduction

Generative models, particularly diffusion models, have shown remarkable capabilities in various domains like text-to-image and text-to-video generation. However, their slow, iterative generation process poses significant computational challenges. This paper introduces a distillation method combining progressive and adversarial techniques, aimed at striking a balance between image quality and mode coverage for one-step or few-step generative processes. The proposed approach, termed SDXL-Lightning, not only enhances the speed of image generation to new heights but also maintains, and in some cases surpasses, the quality produced by the state-of-the-art models.

Theoretical Foundations and Methodology

At the core of our method lies the fusion of progressive and adversarial distillation strategies, innovatively applied to diffusion models. Traditional approaches to reducing inference steps often led to unacceptable quality loss or required an impractically high number of steps to generate acceptable results. Our method, by contrast, leverages the strengths of both progressive and adversarial distillation to directly predict farther along the flow of generation, notably surpassing previous methods in producing high-quality images in fewer steps.

Progressive Distillation: We detail how progressive distillation can ensure the distilled model preserves the original ODE flow and mode coverage but struggles with image sharpness under few inference steps. The inclusion of progressive distillation assists in maintaining the original model behavior, making our distilled models compatible with various LoRA modules and control plugins.
Adversarial Distillation: The adoption of an adversarial loss mechanism at each distillation stage plays a crucial role in enhancing image quality. Instead of relying solely on mean squared error (MSE), which tends to produce blurry images, our method utilizes a pre-trained diffusion U-Net encoder as the discriminator backbone, fully operating in latent space. This approach allows for efficient distillation in high-resolutions while providing flexibility in balancing between sample quality and mode coverage.

Model Distillation and Results

Our distilled models, named SDXL-Lightning, exhibit unparalleled efficiency and quality in text-to-image generation, particularly at 1024px resolution. The models, open-sourced for both LoRA adaptation and full UNet weights, show significant improvements over existing distillation methods:

Efficiency and Quality: Our distillation procedure effectively reduces the required inference steps to as low as one or two while achieving new state-of-the-art results in quality, as evidenced by numerical scores in established metrics such as Fréchet Inception Distance (FID) and CLIP score.
Discriminator Design and Training Techniques: The innovative discriminator design, leveraging the pre-trained diffusion model’s encoder, along with strategic training techniques, ensures stable training and high-quality image generation.
Adaptability and Compatibility: The distilled models demonstrate remarkable compatibility with existing LoRA modules and control plugins, showcasing their potential for easy integration into various applications and further research explorations in generative AI.

Future Directions

While SDXL-Lightning sets a new benchmark in efficient text-to-image generation, future work will explore optimizing the architecture for few-step generation processes and extending the method’s applicability across different domains and modalities. The open sourcing of these distilled models is anticipated to catalyze further advancements in the field.

Concluding Remarks

The proposed progressive adversarial diffusion distillation method represents a significant leap forward in the efficiency of high-quality text-to-image generation. By meticulously combining progressive and adversarial distillation techniques and employing innovative training mechanisms, the resulting SDXL-Lightning models practically balance quality, efficiency, and mode coverage, offering vast potential for real-world applications and further scholarly inquiry.

PDF Markdown

Tweets

https://twitter.com/arankomatsuzaki/status/1760491676391673880

https://twitter.com/_akhaliq/status/1760499120882491666

https://twitter.com/pchapuis/status/1762413469075132899

https://twitter.com/c0nsumption_/status/1760545452070461541

https://twitter.com/gm8xx8/status/1760493456189718987

https://twitter.com/javaeeeee1/status/1760644119473393809

YouTube

Show All Videos