Consistency Models (2303.01469v2)

Published 2 Mar 2023 in cs.LG, cs.CV, and stat.ML

Abstract: Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.

Citations (645)

View on Semantic Scholar

Summary

The paper introduces Consistency Models, which bypass iterative diffusion sampling by learning a function that maps any point on the PF ODE trajectory directly back to the original data.
It details two training methods—Consistency Distillation and Consistency Training—that enforce self-consistency using parameterized functions and EMA-based target networks.
Practical applications include fast one-step generation and versatile image editing tasks, achieving state-of-the-art FID scores on datasets like CIFAR-10 and ImageNet.

This paper introduces Consistency Models (CMs) (2303.01469), a new class of generative models designed to address the slow iterative sampling process inherent in diffusion models while retaining many of their benefits. The core idea is to learn a function that directly maps points from any time step on a Probability Flow (PF) ODE trajectory back to the trajectory's origin (the data sample).

Core Concept: Consistency Function

PF ODE: Diffusion models rely on a PF ODE (Eq. 2) that transforms data $x_0$ into noise $x_T$ . The reverse process generates data from noise $x_T$ by solving the ODE backward.
Consistency Function: Defined as $f: (x_t, t) \mapsto x_\epsilon$ , where $x_t$ is a point on the ODE trajectory at time $t$ , and $x_\epsilon$ is the point near the origin (data point, typically at a small time $\epsilon > 0$ ).
Self-Consistency Property: The defining characteristic is that for any two points $(x_t, t)$ and $(x_{t'}, t')$ on the same ODE trajectory, the consistency function output is identical: $f(x_t, t) = f(x_{t'}, t') = x_\epsilon$ .
Consistency Model: A parameterized function $F_\theta(x, t)$ is trained to approximate the true consistency function $f$ by enforcing this self-consistency property.

Implementation: Parameterization

A crucial aspect is enforcing the boundary condition $F_\theta(x, \epsilon) = x$ . The paper proposes and uses a practical parameterization using skip connections:

1	F_\theta(x, t) = c_skip(t) * x + c_out(t) * NN_\theta(x, t)

where

NN_\theta(x, t)

is a neural network (e.g., based on diffusion model architectures like U-Net), and

c_\text{skip}(t)

,

c_\text{out}(t)

are differentiable functions satisfying:

$c_\text{skip}(\epsilon) = 1$
$c_\text{out}(\epsilon) = 0$

This structure ensures the boundary condition is met and allows leveraging existing diffusion model architectures. The paper uses modified versions of the scaling factors from EDM (2206.00364) to satisfy this for $\epsilon > 0$ .

Implementation: Sampling

One-Step Generation: Sample noise $x_T \sim N(0, T^2I)$ and compute the data sample directly: $\hat{x}_\epsilon = F_\theta(x_T, T)$ . This is very fast, requiring only one network evaluation.
Multi-Step Sampling (Algorithm 1): Improves sample quality by trading compute. It involves alternating denoising steps with the CM and adding noise:
- Select a time $\tau_n$ (from a predefined sequence $T > \tau_1 > ... > \tau_{N-1} > \epsilon$ ).
- Add noise: Sample $z \sim N(0, I)$ , compute $x_{\tau_n} = \hat{x}^{(n-1)} + \sqrt{\tau_n^2 - \epsilon^2} z$ .
- Denoise: Compute $\hat{x}^{(n)} = F_\theta(x_{\tau_n}, \tau_n)$ .
- 3. Output $\hat{x}^{(N-1)}$ .
- The sequence $\{\tau_n\}$ can be found using optimization methods like greedy ternary search to minimize FID.

Training Method 1: Consistency Distillation (CD)

This method trains a CM $F_\theta$ by distilling knowledge from a pre-trained diffusion (score) model $s_\phi$ .

Goal: Enforce $F_\theta(x_{t_{n+1}}, t_{n+1}) \approx F_{\theta^-}(\hat{x}_{t_n}^\phi, t_n)$ for adjacent points on the empirical PF ODE trajectory defined by $s_\phi$ .
Process (Algorithm 2):
- Sample data $x_0$ .
- Sample time index $n \sim U\{1, ..., N-1\}$ .
- Generate noisy sample $x_{t_{n+1}} \sim N(x_0, t_{n+1}^2 I)$ .
- Use one step of a numerical ODE solver (e.g., Heun) with the score model $s_\phi$ to estimate the previous point: $\hat{x}_{t_n}^\phi = x_{t_{n+1}} + (t_n - t_{n+1})\Phi(x_{t_{n+1}}, t_{n+1}; s_\phi)$ .
- Minimize the consistency distillation loss (Eq. 7):
  
  $L_\text{CD}^N = \mathbb{E}[\lambda(t_n) d(F_\theta(x_{t_{n+1}}, t_{n+1}), F_{\theta^-}(\hat{x}_{t_n}^\phi, t_n))]$
Implementation Details:
- $F_{\theta^-}$ is a target network, updated via Exponential Moving Average (EMA) of $F_\theta$ (Eq. 8). Using stop_gradient on the target network output is crucial for stability.
- $d(\cdot, \cdot)$ is a distance metric. LPIPS (1801.03924) works best for images, outperforming L1 and L2.
- $\lambda(t_n)$ is a weighting function (often set to 1).
- Higher-order ODE solvers (like Heun) generally perform better than lower-order ones (like Euler) for computing $\hat{x}_{t_n}^\phi$ .
- The number of discretization intervals $N$ needs tuning (e.g., $N=18$ for CIFAR-10 with Heun).

Training Method 2: Consistency Training (CT)

This method trains a CM $F_\theta$ from scratch, without requiring a pre-trained diffusion model. It makes CMs an independent class of generative models.

Goal: Enforce $F_\theta(x_0 + t_{n+1}z, t_{n+1}) \approx F_{\theta^-}(x_0 + t_n z, t_n)$ , where $z \sim N(0, I)$ .
Process (Algorithm 3): Based on the theoretical result (Theorem 2) that the CD loss approximates the CT loss (Eq. 9) for small step sizes when using Euler solver implicitly.
- Sample data $x_0$ .
- Sample time index $n \sim U\{1, ..., N(k)-1\}$ (where $N(k)$ increases during training).
- Sample noise $z \sim N(0, I)$ .
- Minimize the consistency training loss:
  
  $L_\text{CT}^N = \mathbb{E}[\lambda(t_n) d(F_\theta(x_0 + t_{n+1}z, t_{n+1}), F_{\theta^-}(x_0 + t_n z, t_n))]$
Implementation Details:
- Uses the same EMA target network $F_{\theta^-}$ as CD.
- Crucially uses adaptive schedules for the number of time steps $N(k)$ and the EMA decay rate $\mu(k)$ (where $k$ is the training step). $N(k)$ typically starts small and increases, while $\mu(k)$ starts high (e.g., 0.9) and approaches 1. This balances convergence speed and final quality. Appendix C provides specific schedule formulas.
- LPIPS is also effective here.

Practical Applications & Results

Fast Generation: CMs achieve state-of-the-art FID scores for one-step and two-step generation on CIFAR-10 (3.55/2.93 FID) and ImageNet 64x64 (6.20/4.70 FID) when trained via CD, significantly outperforming Progressive Distillation (PD).
Standalone Performance: When trained via CT, CMs outperform other one-step non-adversarial methods (VAEs, Flows) and achieve results comparable to PD without needing distillation.
Zero-Shot Data Editing: CMs inherit the editing capabilities of diffusion models. Using variations of the multi-step sampling algorithm (Algorithm 4 in Appendix), they can perform:
- Inpainting: Mask unknown regions and iteratively refine using the CM.
- Colorization: Treat color channels as missing information in a transformed space (e.g., YUV or using an orthogonal basis).
- Super-Resolution: Treat high-frequency details as missing information in a transformed space (e.g., using patch averaging and orthogonal basis).
- Stroke-guided Editing (SDEdit): Use a stroke image as the starting point $x_{\tau_1}$ in multi-step sampling.
- Denoising: Apply $F_\theta(x_\sigma, \sigma)$ directly to an image $x_\sigma$ with noise level $\sigma$ .
- Interpolation: Interpolate between the initial noise vectors $z_1, z_2$ (e.g., using spherical linear interpolation) and then apply $F_\theta(\cdot, T)$ .

Implementation Considerations

Architecture: Can reuse U-Net architectures from diffusion models (e.g., NCSN++, ADM).
Target Network: Using an EMA target network with stop_gradient is vital for both CD and CT.
Metric: LPIPS is highly recommended for image data.
Schedules (CT): Carefully designed adaptive schedules for $N$ and $\mu$ are important for CT performance.
Computational Cost: Training cost is comparable to training diffusion models. Inference is much faster (1 network evaluation for one-step, N evaluations for N-step).

Continuous-Time Extensions

The paper also derives continuous-time versions of the CD and CT losses (Appendix B), eliminating the need for discrete time steps $t_n$ . These objectives require calculating Jacobian-vector products, often necessitating forward-mode automatic differentiation, which might not be standard in all frameworks. Experimental results show they can work well, especially continuous-time CT, but may require careful initialization or variance reduction techniques.

PDF Markdown

Related Papers

Lecture Notes in Probabilistic Diffusion Models (2023)
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion (2023)
Phased Consistency Models (2024)
Distributional Diffusion Models with Scoring Rules (2025)
Inductive Moment Matching (2025)

Tweets

https://twitter.com/mnslarcher/status/1756445318172778548

https://twitter.com/arnaud_autef/status/1774892329452929311

https://twitter.com/OrianSharoni/status/1793930105414382079

https://twitter.com/gil2rok/status/1886806042022998340

https://twitter.com/MarvinSchmittML/status/1758044650970165492

https://twitter.com/BoyntonBrian/status/1794047836822131136

YouTube

Show All Videos