Emergent Mind

Abstract

Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality. We release these models weights at https://hf.co/Segmind. Our methodology involves the elimination of residual networks and transformer blocks from the U-Net structure of SDXL, resulting in significant reductions in parameters, and latency. Our compact models effectively emulate the original SDXL by capitalizing on transferred knowledge, achieving competitive results against larger multi-billion parameter SDXL. Our work underscores the efficacy of knowledge distillation coupled with layer-level losses in reducing model size while preserving the high-quality generative capabilities of SDXL, thus facilitating more accessible deployment in resource-constrained environments.

Exploration and exploitation balance in dynamic and unknown environments with SDXL strategy.

Overview

  • The paper introduces Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, scaled-down versions of Stable Diffusion XL with fewer parameters but similar performance.

  • Knowledge Distillation is used, where a smaller model learns from a larger one, focusing on refining U-Net architecture by eliminating less-impactful layers.

  • A methodical pruning strategy helps identify which layers to remove, significantly reducing training steps and computational resources required.

  • Comparative evaluations show the compressed models perform close to SDXL with faster inference times, confirmed by a human preference study.

  • The paper emphasizes the role of the original large models in distillation and suggests future work could apply this approach to other AI models.

Introduction to Model Compression

Stable Diffusion XL (SDXL) is a state-of-the-art text-to-image model greatly admired for its image generation capabilities. However, due to its large size, the model demands considerable computational resources, which can be a barrier for many users. The paper presents an innovative approach to model compression that introduces scaled-down variants of SDXL, called Segmind Stable Diffusion (SSD-1B) and Segmind-Vega. These variants are designed with fewer parameters, aiming to deliver similar performance while enhancing accessibility and reducing computational load.

Knowledge Distillation Approach

The core of this model compression lies in knowledge distillation, a process where a smaller model (student) learns to replicate the performance of a larger model (teacher). The authors achieved size reduction by eliminating certain layers within SDXL's U-Net architecture, focusing on residual networks and transformer blocks that account for substantial parameters. This eliminates redundancy without compromising on image quality. The paper also showcases how these technique preserves the high-quality generative capabilities of the original SDXL. The reduced-sized models, released on popular machine learning platforms, illustrate the successful application of knowledge distillation at the layer level.

Efficient Diffusion Models and Training

Investigating the efficient adaptation of diffusion models, the researchers adopted a methodical pruning strategy, rigorously evaluating which layers can be omitted. They chose layers whose absence had minimal impact on image generation quality, confirmed through both human evaluation and heuristic methods. Training details reveal that models were optimized for high resolution imagery and were trained using mixed-precision on powerful GPUs, showcasing the intensive computational effort involved. Even so, the compression methods employed dramatically decreased both the training steps and resources needed.

Evaluation and Implications

Comparative evaluations highlight the potential of model compression. SSD-1B and Segmind-Vega performed impressively, benchmarking close to the larger SDXL’s output with significantly faster inference times. The validity of these findings was reinforced by a comprehensive human preference study, where the distilled SSD-1B model was even slightly favored over SDXL. These conclusions not only underscore the feasibility of compressing complex generative models but also hint at the applicability of such methods across other large machine learning models. The paper concludes by recognizing the importance of the parent models in distillation and suggests possible future explorations into distilling other major AI models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.