Emergent Mind

Phased Consistency Model

(2405.18407)
Published May 28, 2024 in cs.LG and cs.CV

Abstract

The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. More details are available at https://g-u-n.github.io/projects/pcm/.

PCM vs. AnimateLCM video generation under 1 to 4 inference steps.

Overview

  • Phased Consistency Model (PCM) introduces innovations to address limitations in latent consistency models (LCMs) for image and video generation, focusing on consistency, controllability, and efficiency.

  • PCM phases the ODE trajectory into multiple sub-trajectories, employs an adaptable adversarial loss function, and enhances training and sampling efficiency for improved image and video synthesis.

  • Experimental validation shows PCM's superiority in both one-step and multi-step generation settings for images and videos, positioning it as a state-of-the-art method in generative models.

Phased Consistency Model for Efficient Image and Video Generation

Introduction

In recent advancements, diffusion models have become the standard for generative image synthesis, showcasing their ability to produce high-quality and diverse images. Despite their efficacy, the iterative nature of these models results in resource-intensive and time-consuming generative processes. Consistency models (CMs) have been proposed to mitigate this by reducing the number of iterative steps required for sample generation. However, their application to high-resolution, text-conditioned image generation in the latent space, specifically through latent consistency models (LCMs), still has notable drawbacks.

This paper addresses the limitations of LCMs and introduces the Phased Consistency Model (PCM) as a solution. PCM generalizes the design space of consistency models and rectifies the identified shortcomings, demonstrating superior performance across 1 to 16 step generation settings. Additionally, PCM's multi-step refinement capability leads to superior or comparable 1-step generation results compared to state-of-the-art methods designed explicitly for 1-step generation. PCM's methodology also extends to video generation, establishing a state-of-the-art few-step text-to-video generator.

Key Findings and Contributions

Limitations of LCMs

The paper identifies and elaborates on three primary flaws of LCMs:

  1. Consistency: LCMs face inconsistency issues due to their reliance on purely stochastic multi-step sampling algorithms, leading to varied results with the same seed across different inference steps. This is exacerbated in scenarios requiring both few and many inference steps.
  2. Controllability: LCMs demonstrate reduced controllability in image generation. The models show limited acceptance of classifier-free guidance (CFG) at low-step settings and inadequate responsiveness to negative prompts, often yielding undesirable results such as generating black dogs even when explicitly instructed not to.
  3. Efficiency: LCMs tend to generate inferior results when constrained to fewer than four inference steps. The conventional L2 or Huber loss employed in the LCM procedure is insufficient for effective supervision in low-step settings, thereby limiting the efficiency of sampling.

The Phased Consistency Model (PCM)

PCM addresses these limitations through the following innovations:

  1. Sub-Trajectory Consistency: Instead of mapping all points along the entire ODE trajectory to a single solution, PCM phases the ODE trajectory into multiple sub-trajectories. Each sub-trajectory enforces self-consistency independently, thereby reducing error accumulation and facilitating deterministic sampling.
  2. Adaptable Loss Function: PCM introduces an adversarial loss in the latent space to enhance sample quality in few-step settings by providing more fine-grained supervision.
  3. Efficient Training and Sampling: PCM's design allows for simpler and more efficient training while supporting larger CFG values and better responsiveness to negative prompts, owing to an improved consistency distillation stage without the CFG strategy.

Experimental Validation

PCM's effectiveness was validated against widely recognized image generation benchmarks (COCO, CC12M) using stable diffusion v1-5 and SDXL. Additionally, PCM's applicability to video generation was demonstrated using AnimateLCM.

Image Generation

  • One-Step Generation: PCM significantly outperformed LCM and CTM in one-step generation, achieving comparable or superior results to GAN-based and other state-of-the-art one-step methods.
  • Multi-Step Generation: PCM showed strong performance across multiple steps, exhibiting robust consistency and control. Notably, the improvement in results was more pronounced with an increased number of steps, highlighting PCM's superior multi-step refinement capability.

Video Generation

PCM was also tested for text-to-video generation, demonstrating consistent superior performance across metrics such as CLIP score, flow magnitude, and CLIP consistency. PCM effectively supported few-step generation for video, establishing its versatility beyond static image generation.

Implications and Future Work

The results indicate that PCM not only extends the capabilities of consistency models but also opens new avenues for efficient high-resolution image and video generation. The phased consistency approach can be further explored and optimized for various generative tasks, potentially expanding to applications such as real-time video synthesis and interactive AI-driven media creation.

Future research could explore more sophisticated implementations of adversarial consistency losses and further refinements in sub-trajectory handling. The versatility shown by PCM in extending to video generation suggests that this methodology could be adapted and tested in even broader domains of conditional generative models.

Conclusion

This paper introduces an effective and efficient model for high-resolution, text-conditioned image and video generation. By addressing and rectifying the key limitations of latent consistency models through PCM, the research sets a new benchmark in the field, characterized by its strong performance across multiple settings and tasks. PCM's methodical advancements herald significant improvements in generative model efficiency and output quality, offering promising directions for future exploration and application.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube