Emergent Mind

Abstract

Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.

SDXL produces blurry images with 16 NFEs in 1s; SDXS-1024 generates 30 clear images, can train ControlNet.

Overview

  • The paper introduces SDXS-512 and SDXS-1024, enhanced latent diffusion models designed for high-speed, real-time image generation.

  • It addresses computational and operational efficiency by model miniaturization and reducing sampling steps, achieving up to 60 times faster performance.

  • Methodological innovations include knowledge distillation and one-step diffusion model training, ensuring quality while minimizing computational demand.

  • Experimental results show these models maintain high image fidelity and coherence, with potential for future applications in real-time, interactive technologies.

SDXS: Accelerating Latent Diffusion Models for Real-Time Image Generation with Image Conditions

Introduction to Latent Diffusion Models and Existing Challenges

In recent times, latent diffusion models have emerged as a prominent technology for image generation, showcasing exceptional capabilities in generating high-quality images. These models, particularly when applied to tasks such as text-to-image conversion, have significantly advanced the field. The foundational models such as SD v1.5 and SDXL have set benchmarks in quality; however, they exhibit substantial computational demands and operational latency due to their intricate architecture and iterative sampling mechanisms.

Addressing the Challenges

Recognizing these limitations, our presented work embarks on a dual-strategy approach of model miniaturization along with a reduction in sampling steps. This approach seeks not only to retain the quality of image generation but also to significantly enhance operational efficiency. The study introduces SDXS-512 and SDXS-1024, two models realizing a leap in inference speed to approximately 100 FPS and 30 FPS on a single GPU for generating $512\times512$ and $1024\times1024$ images, respectively. This achievement marks a substantial improvement in computational efficiency, being $30\times$ and $60\times$ faster than their predecessors SD v1.5 and SDXL, respectively.

Methodological Insights

Model Miniaturization

A significant portion of our methodology centers around the distillation of the U-Net and VAE decoder within the latent diffusion framework. By leveraging knowledge distillation, we streamline these segments of the model, maintaining the capacity for high-quality output while markedly reducing computational overhead. In particular, the strategy includes employing a light-weight image decoder that closely mimics the original VAE decoder’s output, utilizing a specially curated training loss constituting both output distillation and GAN loss.

Reduction in Sampling Steps

To circumvent the extensive computational requirements due to iterative sampling, our work innovates a one-step diffusion model (DM) training technique. This approach optimizes the sampling process, substantially reducing the operational latency involved in image generation. By adopting feature matching and score distillation within our training regimen, we establish a pathway to transition from multi-step to efficient one-step operation.

Experimental Validation and Outcomes

The superiority of the SDXS models is underscored through comprehensive experimentation. Benchmarking against existing models such as SD v1.5 and SDXL underlines the remarkable efficiency gains achieved without a compromise in image quality. The models' efficacy is demonstrated across different resolutions, showcasing latency improvements while maintaining competitive FID scores and CLIP scores, indicators of image fidelity and coherence with textual prompts.

Further Application in Image-Conditioned Control

Expanding upon the innovative contributions, this paper also ventures into the application of the optimized model for tasks involving image-conditioned generation. By adapting the distilled model to work with ControlNet for efficient image-to-image translation, we open avenues for employing these advanced capabilities on edge devices, highlighting the model's versatility and practical utility.

Future Perspectives and Conclusion

The paper concludes with a reflection on the promising future directions that emerge from this research. The possibility of deploying such efficient, high-quality image generation models on low-power devices presents an exciting frontier for the development of real-time, interactive applications across various sectors. As this work paves the way for real-time, efficient image generation with latent diffusion models, it sets a foundational stage for further explorations that could extend these advancements to even broader applications in AI-driven image and video generation tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube