Emergent Mind

MLCM: Multistep Consistency Distillation of Latent Diffusion Model

(2406.05768)
Published Jun 9, 2024 in cs.CV and cs.AI

Abstract

Distilling large latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face a dilemma where they either (i) depend on multiple individual distilled models for different sampling budgets, or (ii) sacrifice generation quality with limited (e.g., 2-4) and/or moderate (e.g., 5-8) sampling steps. To address these, we extend the recent multistep consistency distillation (MCD) strategy to representative LDMs, establishing the Multistep Latent Consistency Models (MLCMs) approach for low-cost high-quality image synthesis. MLCM serves as a unified model for various sampling steps due to the promise of MCD. We further augment MCD with a progressive training strategy to strengthen inter-segment consistency to boost the quality of few-step generations. We take the states from the sampling trajectories of the teacher model as training data for MLCMs to lift the requirements for high-quality training datasets and to bridge the gap between the training and inference of the distilled model. MLCM is compatible with preference learning strategies for further improvement of visual quality and aesthetic appeal. Empirically, MLCM can generate high-quality, delightful images with only 2-8 sampling steps. On the MSCOCO-2017 5K benchmark, MLCM distilled from SDXL gets a CLIP Score of 33.30, Aesthetic Score of 6.19, and Image Reward of 1.20 with only 4 steps, substantially surpassing 4-step LCM [23], 8-step SDXL-Lightning [17], and 8-step HyperSD [33]. We also demonstrate the versatility of MLCMs in applications including controllable generation, image style transfer, and Chinese-to-image generation.

MLCM with image style transfer; styles at top, two-step sampling for highly stylized images.

Overview

  • The paper introduces the Multistep Latent Consistency Model (MLCM) to distill large latent diffusion models into more efficient versions while preserving high-quality image synthesis.

  • It presents a progressive training strategy and leverages teacher model states to enhance model performance and reduce dependency on high-quality training datasets.

  • Empirical evaluations show that the MLCM outperforms several baselines in terms of CLIP score, aesthetic score, and image reward, illustrating its practical effectiveness.

An Expert Review of "MLCM: Multistep Consistency Distillation of Latent Diffusion Model"

The paper "MLCM: Multistep Consistency Distillation of Latent Diffusion Model" introduces a novel approach to distilling large latent diffusion models (LDMs) into more efficient models while maintaining high-quality image synthesis. In essence, the authors propose the Multistep Latent Consistency Model (MLCM) approach, underpinned by Multistep Consistency Distillation (MCD). This method addresses significant challenges faced by existing methods, such as dependency on multiple models for different sampling budgets and quality degradation with limited sampling steps.

Core Contributions

  1. Multistep Consistency Distillation (MLCD): The paper extends MCD to representative LDMs, creating a unified model (MLCM) for various sampling steps by enforcing consistency within segmental partitions of the latent-space ODE trajectory.
  2. Progressive Training Strategy: To enhance inter-segment consistency, the paper introduces a progressive training strategy, significantly boosting the quality of few-step generations.
  3. Leveraging Teacher Model States: The authors leverage states from the teacher model's sampling trajectory, reducing the need for high-quality training datasets and aligning the training and inference phases.
  4. Human Preference Compatibility: The proposed method seamlessly integrates preference learning strategies to improve visual quality and aesthetic appeal.

Empirical Evaluation

The authors conducted a comprehensive evaluation using the MSCOCO-2017 5K benchmark, showcasing substantial performance improvements over existing methods. The key findings include the following metrics for MLCM distilled from SDXL:

  • CLIP Score: 33.30
  • Aesthetic Score: 6.19
  • Image Reward: 1.20

These improvements are significant, particularly when considering the performance of 4-step MLCM against strong baselines like 4-step LCM, 8-step SDXL-Lightning, and 8-step HyperSD.

Strong Numerical Results and Bold Claims

The paper makes several quantitative claims that challenge established baselines:

  • The streamlined MLCM generates high-quality images within 2-4 steps, surpassing methods requiring 8 steps.
  • The progressive MLCD significantly reduces segmentation errors, further enhancing generation quality.
  • Employing a better teacher model (e.g., PVXL) for trajectory estimation markedly improves MLCM's performance.

Methodological Advancements

The research presented combines theoretical and practical innovations, reinforcing its findings through methodical experimentation.

  1. Segmentation of ODE Trajectory: By dividing the latent space ODE trajectory into multiple segments, MLCM maintains high fidelity over fewer steps, mitigating error accumulation.
  2. Transition from Teacher to Student: The transition from teacher states to student model training optimizes the learning process by leveraging intermediate states from the teacher's denoising steps.
  3. Human Preference Integration: The inclusion of reward consistency and feedback learning ensures that MLCM outputs are not only technically proficient but also align with human aesthetic preferences.

Practical Implications and Future Developments

From a practical standpoint, MLCM holds the potential to enhance various applications, including controllable generation, image stylization, and Chinese-to-image generation. Given the illustrated versatility, future research could explore extending MLCM for video generation and other high-dimensional applications. Moreover, refining the one-step generation capabilities while preserving quality remains a promising avenue for subsequent investigations.

Conclusion

The paper provides a robust framework for accelerating LDMs via multistep consistency distillation, successfully addressing existing limitations. The empirical results, combined with methodological rigor, position MLCM as a notable contribution to the diffusion model landscape. Future work will likely benefit from the principles established in this study, advancing both theoretical understanding and practical implementations in AI-driven image synthesis.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.