Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

High-Resolution Image Synthesis with Latent Diffusion Models (2112.10752v2)

Published 20 Dec 2021 in cs.CV

Abstract: By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .

Citations (11,929)

Summary

  • The paper introduces Latent Diffusion Models (LDMs) that decouple perceptual compression from generative modeling for efficient high-resolution image synthesis.
  • It employs a time-conditional UNet in latent space to achieve faster convergence and superior FID scores compared to traditional pixel-based diffusion models.
  • The approach supports multi-modal conditioning, delivering state-of-the-art results in text-to-image synthesis, super-resolution, and inpainting tasks.

High-Resolution Image Synthesis with Latent Diffusion Models

Introduction and Motivation

The paper introduces Latent Diffusion Models (LDMs), a framework for high-resolution image synthesis that leverages diffusion models in the latent space of pretrained autoencoders. Traditional diffusion models (DMs) operate in pixel space, which incurs substantial computational costs for both training and inference due to the high dimensionality and the need to model imperceptible details. LDMs address this by decoupling perceptual compression from generative modeling: an autoencoder first compresses images into a lower-dimensional latent space that preserves perceptual fidelity, and a diffusion model is then trained in this space. This approach enables efficient training and sampling, democratizing access to high-quality generative models.

Methodology

Perceptual Compression via Autoencoders

LDMs begin with a perceptual compression stage using an autoencoder trained with a combination of perceptual and adversarial losses. The encoder EE maps an image xRH×W×3x \in \mathbb{R}^{H \times W \times 3} to a latent z=E(x)z = E(x), and the decoder DD reconstructs xx from zz. The latent space is regularized either via a KL penalty (VAE-style) or vector quantization (VQGAN-style), with the latter often yielding better scaling properties for downstream generative modeling. The compression factor ff (downsampling ratio) is a critical hyperparameter, with f=4f=4 or f=8f=8 providing a favorable trade-off between efficiency and reconstruction fidelity.

(Figure 1)

Figure 1: Perceptual and semantic compression: LDMs suppress semantically meaningless information, enabling efficient training and inference in latent space.

Latent Diffusion Modeling

The diffusion model is trained in the latent space using a time-conditional UNet backbone. The objective is a denoising score-matching loss:

LDM:=EE(x),ϵN(0,1),t[ϵϵθ(zt,t)22]LDM := \mathbb{E}_{E(x), \epsilon \sim \mathcal{N}(0, 1), t}\left[ \| \epsilon - \epsilon_\theta(z_t, t) \|_2^2 \right]

where ztz_t is a noisy version of the latent zz. This formulation allows the model to focus on semantic content rather than pixel-level noise, resulting in faster convergence and reduced compute requirements.

Conditioning Mechanisms

LDMs generalize to conditional generation tasks by augmenting the UNet with cross-attention layers. Conditioning inputs yy (e.g., text, semantic maps, bounding boxes) are encoded via domain-specific encoders (e.g., transformers for text) and injected into the UNet via cross-attention. This enables flexible multi-modal conditioning and supports tasks such as text-to-image, layout-to-image, and semantic synthesis. Figure 2

Figure 2: Conditioning LDMs via concatenation or cross-attention enables flexible multi-modal generation.

Experimental Results

Efficiency and Compression Trade-offs

Extensive experiments demonstrate that LDMs with moderate compression (f=4f=4 or f=8f=8) achieve superior sample quality and throughput compared to pixel-based DMs and highly compressed AR models. For example, on ImageNet, LDM-4 achieves a 38-point FID improvement over pixel-based DMs after 2M training steps, with a 2.7×2.7\times speedup in sampling.

(Figure 3)

Figure 3: FID and IS on LSUN-Churches vs training iterations, showing rapid convergence for LDMs with moderate compression.

Unconditional and Conditional Generation

LDMs set new state-of-the-art FID scores for unconditional synthesis on CelebA-HQ (FID=5.11) and competitive results on FFHQ, LSUN-Churches, and LSUN-Bedrooms. Precision and recall metrics confirm improved mode coverage over GANs. For class-conditional ImageNet generation, LDMs outperform previous DMs (ADM) with fewer parameters and lower compute.

(Figure 4)

Figure 4: LDM samples on CelebA-HQ, FFHQ, LSUN-Churches, LSUN-Beds, and ImageNet, demonstrating high-fidelity synthesis.

Text-to-Image and Layout-to-Image Synthesis

The cross-attention mechanism enables LDMs to excel in text-to-image synthesis. On MS-COCO, LDM-KL-8-G achieves FID=12.61 and IS=26.62, matching or surpassing recent AR and DM baselines with significantly fewer parameters. Layout-to-image models trained on OpenImages and COCO outperform prior works in FID and qualitative diversity.

(Figure 5)

Figure 5: Text-to-image samples for user-defined prompts, illustrating semantic alignment and visual diversity.

Super-Resolution and Inpainting

LDMs are effective for super-resolution and inpainting tasks. On ImageNet 6425664 \rightarrow 256 super-resolution, LDM-SR achieves FID=2.4, outperforming SR3 in FID while maintaining competitive IS. Inpainting experiments on Places show LDMs surpass LaMa and CoModGAN in FID, with user studies confirming human preference for LDM outputs.

(Figure 6)

Figure 6: Super-resolution comparison: LDM-SR renders realistic textures and generalizes to arbitrary inputs.

Implementation Considerations

  • Autoencoder Training: Use perceptual and adversarial losses; regularize latent space with KL or VQ. For f=4f=4, latent dimensionality is 64×64×364 \times 64 \times 3 for 256×256256 \times 256 images.
  • Diffusion Model: UNet backbone with time embedding; train in latent space using reweighted denoising loss. For conditional tasks, inject conditioning via cross-attention.
  • Sampling: DDIM or DDPM samplers; 50–200 steps suffice for high-quality synthesis in latent space.
  • Compute: LDMs require 1–2 orders of magnitude less compute than pixel-based DMs or AR models. Training on a single A100 GPU is feasible for 2562256^2 images.
  • Scaling: Convolutional sampling enables generalization to resolutions >2562>256^2 (e.g., 102421024^2) for spatially conditioned tasks.

Limitations and Future Directions

While LDMs dramatically reduce computational requirements, sequential sampling remains slower than GANs. Reconstruction fidelity of the autoencoder can be a bottleneck for tasks requiring pixel-level accuracy. Further research is needed on scaling to even higher resolutions, improving autoencoder architectures, and exploring new conditioning modalities. Societal impacts include democratization of generative modeling but also risks of misuse and data leakage.

Conclusion

Latent Diffusion Models represent a significant advance in efficient, high-fidelity image synthesis. By decoupling perceptual compression from generative modeling and leveraging cross-attention for flexible conditioning, LDMs achieve state-of-the-art results across unconditional, conditional, super-resolution, and inpainting tasks, all while drastically reducing computational requirements. This framework opens new avenues for scalable, accessible generative modeling in both research and real-world applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 29 tweets and received 503 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com