Emergent Mind

Consistency Models Made Easy

(2406.14548)
Published Jun 20, 2024 in cs.LG and cs.CV

Abstract

Consistency models (CMs) are an emerging class of generative models that offer faster sampling than traditional diffusion models. CMs enforce that all points along a sampling trajectory are mapped to the same initial point. But this target leads to resource-intensive training: for example, as of 2024, training a SoTA CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an alternative scheme for training CMs, vastly improving the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs with a specific discretization. We can thus fine-tune a consistency model starting from a pre-trained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly improved training times while indeed improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained of hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling law of CMs under ECT, showing that they seem to obey classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales. Code (https://github.com/locuslab/ect) is available.

ECT outperforms Consistency Distillation (CD) on CIFAR-10 without distillation or extra adversarial supervision.

Overview

  • The paper introduces Easy Consistency Tuning (ECT), a novel training strategy for consistency models (CMs) that enhances computational efficiency and generative performance.

  • ECT leverages a continuous-time schedule from diffusion pretraining to a tighter consistency condition, significantly reducing training costs and improving sample quality.

  • Through the implementation of dropout consistency and adaptive weighting, the method achieves robust training dynamics and state-of-the-art results, exemplified by a 2-step FID of 2.73 on CIFAR-10 using a single A100 GPU.

An In-depth Analysis of "Consistency Models Made Easy"

The paper Consistency Models Made Easy by Zhengyang Geng et al., builds on the foundational concepts of diffusion models (DMs) and introduces a more computationally efficient approach to training consistency models (CMs). The authors propose a new training strategy termed Easy Consistency Tuning (ECT), which promises to significantly accelerate the training process of CMs while achieving state-of-the-art generative performance.

Consistency models, akin to diffusion models, offer a novel method for generating high-quality data samples. Traditional diffusion models operate by gradually transforming data distributions into a prior distribution (e.g., Gaussian noise) using a stochastic differential equation (SDE). Sampling from such models demands numerous model evaluations, often leading to high computational costs. To mitigate this, researchers have explored various fast samplers and distillation methods, albeit with trade-offs in sample quality.

Key Contributions

The paper makes several key contributions:

  1. Reformulating CM Training: The authors illustrate how the differential consistency condition can be applied to consistency models. By reinterpreting CM trajectories via specific differential equations, they highlight that diffusion models can be viewed as a special case of CMs with less stringent discretization requirements.
  2. Easy Consistency Tuning (ECT): ECT is proposed as a simplified, more efficient training scheme. This method leverages a continuous-time schedule that transitions progressively from diffusion pretraining to a tighter consistency condition. This interpolation allows ECT to start from a pre-trained diffusion model, reducing the initial training cost.
  3. Dropout and Adaptive Weighting: The paper emphasizes that dropout consistency across noise levels can balance gradient flows, significantly improving CM training dynamics. The inclusion of adaptive weighting functions is also shown to reduce the variance of gradients and accelerate convergence.
  4. Scaling Laws and Practical Efficiency: Through extensive experiments, ECT demonstrates classic power-law scaling in training compute, indicating robustness and adaptability to larger datasets and model scales. The authors provide a computationally efficient path to state-of-the-art performance, exemplified by achieving a 2-step FID of 2.73 on CIFAR-10 within one hour on a single A100 GPU.

Detailed Insights

Efficiency and Scalability

ECT's efficiency is evident in the sharp reduction of training and sampling costs without compromising on sample quality. The paper reports a considerable decrease in training FLOPs, with ECT achieving notable results on benchmark datasets CIFAR-10 and ImageNet 64$\times$64. The method utilizes a fine-tuning approach from pre-trained diffusion models, manifesting in improved sample quality while utilizing significantly lesser computational resources compared to iCT and other previous methods.

Theoretical Implications

The differential consistency condition provides a robust theoretical framework that redefines how consistency models are trained. The method not only simplifies the understanding of CM training but also offers a practical path to leveraging preexisting diffusion models. This reconceptualization aligns with dynamical systems theory and bridges the gap between diffusion modeling and consistency training.

Practical Relevance

By optimizing the generative models' training process, the paper lays the groundwork for practical applications where computational resources are constrained. For instance, applications in creative industries, where generating high-quality visual content swiftly is paramount, can benefit immensely from the efficiencies introduced by ECT. Furthermore, the robustness of scaling laws proposed by the authors suggests potential for wider adoption in various domains requiring large-scale data generation.

Future Directions

Potential areas of exploration suggested by the findings include:

  • Parameter Efficient Fine-Tuning (PEFT): Given ECT's efficiency, implementing PEFT techniques could further reduce computational demands while maintaining generative quality.
  • Volume Tuning on Different Data: Investigating how tuning consistency models on data distinct from those used in pretraining affects generalization capabilities merits further research.
  • Cross-domain Applications: Examining the adaptability of ECT in domains beyond image generation, such as video or 3D object synthesis, could expand its applicability.

Conclusion

Consistency Models Made Easy presents a pivotal advancement in the efficient training of consistency models. The introduction of ECT provides a streamlined approach that reduces computational overheads and harnesses the strengths of both diffusion models and consistency training. Theoretical rigor combined with empirical validation positions this work as a significant stride towards more practical and scalable generative modeling solutions.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.