SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (2307.01952v1)
Abstract: We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
- eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324, 2022.
- TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248, 2023.
- Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021.
- Distilling the Knowledge in Diffusion Models. CVPR Workshop on Generative Models for Computer Vision, 2023.
- Structure and content-guided video synthesis with diffusion models, 2023.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv:2212.05032, 2023.
- Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022.
- Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. URL https://www.crosslabs.org/blog/diffusion-with-offset-noise.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500, 2017.
- Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022.
- Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020.
- Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661, 2023.
- Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005.
- OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
- Distribution Augmentation for Generative Modeling. In International Conference on Machine Learning, pages 5006–5019. PMLR, 2020.
- Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022.
- On Architectural Compression of Text-to-Image Diffusion Models. arXiv:2305.15798, 2023.
- Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv:2305.01569, 2023.
- SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. arXiv:2306.00980, 2023.
- Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023.
- Microsoft coco: Common objects in context, 2015.
- Character-aware models improve visual text rendering, 2023.
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073, 2021.
- On distillation of guided diffusion models, 2023.
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, 2021.
- NovelAI. Novelai improvements on stable diffusion, 2023. URL https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- Scalable Diffusion Models with Transformers. arXiv:2212.09748, 2022.
- Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021.
- Aditya Ramesh. How dall·e 2 works, 2022. URL http://adityaramesh.com/posts/dalle2/dalle2.html.
- Zero-shot text-to-image generation, 2021.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022.
- High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, 2022.
- Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022.
- Improved Techniques for Training GANs. arXiv:1606.03498, 2016.
- DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv:2305.15957, 2023.
- Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022.
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015.
- Denoising diffusion implicit models. arXiv:2010.02502, 2020a.
- Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020b.
- Andreas Stöckl. Evaluating a synthetic image dataset generated with stable diffusion. arXiv:2211.01777, 2022.
- Yu Takagi and Shinji Nishimoto. High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023.
- Boosting gui prototyping with diffusion models. arXiv preprint arXiv:2306.06233, 2023.
- Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022.
- Scaling autoregressive models for content-rich text-to-image generation, 2022.
- Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric, 2018.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.