Emergent Mind

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

(2307.01952)
Published Jul 4, 2023 in cs.CV and cs.AI

Abstract

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models

Varying size-conditioning improves image quality; demonstrated with 4 samples from a $512^2$ model.

Overview

  • SDXL builds on Stable Diffusion, expanding its UNet structure, which tripled in size due to a more intricate arrangement of attention blocks and a dual text encoder.

  • The model reduces the use of transformers at high feature levels for efficiency and includes a combined text encoder—CLIP ViT-L and OpenCLIP ViT-bigG—for improved text conditioning.

  • SDXL introduces size conditioning and crop conditioning to better utilize training data and align generated images with object-centered aesthetics.

  • The paper emphasizes multi-aspect training to handle different image aspect ratios, promoting diversity in output that matches real-world distributions.

  • SDXL emphasizes openness by sharing its source code and model weights, contrasting the typical 'black-box' nature of advanced image generators.

Introduction to SDXL

In the vast and rapidly evolving landscape of text-to-image synthesis, SDXL emerges as a remarkable enhancement to the widely known Stable Diffusion framework. Its predecessor has already established itself as a foundational tool for a myriad of applications, ranging from entertainment to scientific visualizations. SDXL takes a leap forward by implementing an expanded UNet backbone which is thrice the size of its antecedents. This is achieved through a more complex distribution of attention blocks and an enlarged cross-attention context enabled by a dual text encoder. The architecture's novelty is also embodied in the introduction of novel conditioning techniques that do not require additional supervision for training, as well as a distinct refinement model aimed at post-processing the generated images to append visual fidelity.

Architectural Enrichments in SDXL

The architectural advancements manifest themselves in several dimensions. The model eschews transformers at the highest feature level for efficiency and instead enlists them extensively at lower levels. The heterogeneity of the distribution of transformer blocks underscores a concentrated computation shift towards lower-level features within the UNet. A salient shift from the original architecture is marked by the adoption of a powerful combined text encoder—CLIP ViT-L and OpenCLIP ViT-bigG—which are concatenated to intensify text conditioning capability. An aggregation of the pooled text embedding from OpenCLIP further fortifies the text-based conditioning, resulting in a UNet with 2.6 billion parameters.

Innovations in Conditioning Techniques

SDXL introduces two ingenious conditioning mechanisms. The first, size conditioning—accounts for the original spatial dimensions of training images, mitigating the issue of downsampling or discarding images below a pre-set resolution threshold; an aspect that has historically handicapped LDMs. This development ensures a more thorough utilization of available data, avoiding loss in model generalization capability. The second method, crop conditioning—ensures the model is attuned to the amount of cropping applied during the training phase, annuls negative artifacts that could arise from random cropping, and aligns the model's outputs with aesthetically appealing, object-centered preferences. Furthermore, multi-aspect training prepares SDXL for handling multiple aspect ratios, a significant step towards creating diverse and naturally appealing images in line with the real-world distribution of aspect ratios.

Unified Improvement and Transparency

Conclusively, SDXL ventures beyond just an incremental improvement by providing a cohesive model that encompasses both structural and conditioning advancements. In stark contrast with the 'black-box' approach often characteristic of cutting-edge image generators, SDXL stands out by releasing its code and model weights to the community, fostering an environment of open research and methodical transparency. This approach addresses concerns regarding reproducibility, innovation, and the assessment of biases in AI image generation models, all while achieving superior performance and production of visually compelling outputs compared to past iterations of Stable Diffusion.

The presented work signifies a significant progression in the text-to-image domain, offering several trajectories for further enhancement in model performance, architecture, and the potential for distillation to optimize computational demands. The transparency and open nature of SDXL serve as a catalyst for ongoing research and possibly pave the way for future breakthroughs in generative modeling.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube