Phased Consistency Models (2405.18407v2)

Published 28 May 2024 in cs.LG and cs.CV

Abstract: Consistency Models (CMs) have made significant progress in accelerating the generation of diffusion models. However, their application to high-resolution, text-conditioned image generation in the latent space remains unsatisfactory. In this paper, we identify three key flaws in the current design of Latent Consistency Models (LCMs). We investigate the reasons behind these limitations and propose Phased Consistency Models (PCMs), which generalize the design space and address the identified limitations. Our evaluations demonstrate that PCMs outperform LCMs across 1--16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. Our code is available at https://github.com/G-U-N/Phased-Consistency-Model.

Citations (9)

View on Semantic Scholar

Summary

The paper's main contribution is the introduction of PCM, which phases ODE trajectories into sub-trajectories to reduce error accumulation.
It employs an adaptable adversarial loss to enhance few-step generation quality and improve controllability in image synthesis.
PCM achieves state-of-the-art performance in both image and text-to-video generation across multiple benchmarks.

Phased Consistency Model for Efficient Image and Video Generation

Introduction

In recent advancements, diffusion models have become the standard for generative image synthesis, showcasing their ability to produce high-quality and diverse images. Despite their efficacy, the iterative nature of these models results in resource-intensive and time-consuming generative processes. Consistency models (CMs) have been proposed to mitigate this by reducing the number of iterative steps required for sample generation. However, their application to high-resolution, text-conditioned image generation in the latent space, specifically through latent consistency models (LCMs), still has notable drawbacks.

This paper addresses the limitations of LCMs and introduces the Phased Consistency Model (PCM) as a solution. PCM generalizes the design space of consistency models and rectifies the identified shortcomings, demonstrating superior performance across 1 to 16 step generation settings. Additionally, PCM's multi-step refinement capability leads to superior or comparable 1-step generation results compared to state-of-the-art methods designed explicitly for 1-step generation. PCM's methodology also extends to video generation, establishing a state-of-the-art few-step text-to-video generator.

Key Findings and Contributions

Limitations of LCMs

The paper identifies and elaborates on three primary flaws of LCMs:

Consistency: LCMs face inconsistency issues due to their reliance on purely stochastic multi-step sampling algorithms, leading to varied results with the same seed across different inference steps. This is exacerbated in scenarios requiring both few and many inference steps.
Controllability: LCMs demonstrate reduced controllability in image generation. The models show limited acceptance of classifier-free guidance (CFG) at low-step settings and inadequate responsiveness to negative prompts, often yielding undesirable results such as generating black dogs even when explicitly instructed not to.
Efficiency: LCMs tend to generate inferior results when constrained to fewer than four inference steps. The conventional L2 or Huber loss employed in the LCM procedure is insufficient for effective supervision in low-step settings, thereby limiting the efficiency of sampling.

The Phased Consistency Model (PCM)

PCM addresses these limitations through the following innovations:

Sub-Trajectory Consistency: Instead of mapping all points along the entire ODE trajectory to a single solution, PCM phases the ODE trajectory into multiple sub-trajectories. Each sub-trajectory enforces self-consistency independently, thereby reducing error accumulation and facilitating deterministic sampling.
Adaptable Loss Function: PCM introduces an adversarial loss in the latent space to enhance sample quality in few-step settings by providing more fine-grained supervision.
Efficient Training and Sampling: PCM's design allows for simpler and more efficient training while supporting larger CFG values and better responsiveness to negative prompts, owing to an improved consistency distillation stage without the CFG strategy.

Experimental Validation

PCM's effectiveness was validated against widely recognized image generation benchmarks (COCO, CC12M) using stable diffusion v1-5 and SDXL. Additionally, PCM's applicability to video generation was demonstrated using AnimateLCM.

Image Generation

One-Step Generation: PCM significantly outperformed LCM and CTM in one-step generation, achieving comparable or superior results to GAN-based and other state-of-the-art one-step methods.
Multi-Step Generation: PCM showed strong performance across multiple steps, exhibiting robust consistency and control. Notably, the improvement in results was more pronounced with an increased number of steps, highlighting PCM's superior multi-step refinement capability.

Video Generation

PCM was also tested for text-to-video generation, demonstrating consistent superior performance across metrics such as CLIP score, flow magnitude, and CLIP consistency. PCM effectively supported few-step generation for video, establishing its versatility beyond static image generation.

Implications and Future Work

The results indicate that PCM not only extends the capabilities of consistency models but also opens new avenues for efficient high-resolution image and video generation. The phased consistency approach can be further explored and optimized for various generative tasks, potentially expanding to applications such as real-time video synthesis and interactive AI-driven media creation.

Future research could explore more sophisticated implementations of adversarial consistency losses and further refinements in sub-trajectory handling. The versatility shown by PCM in extending to video generation suggests that this methodology could be adapted and tested in even broader domains of conditional generative models.

Conclusion

This paper introduces an effective and efficient model for high-resolution, text-conditioned image and video generation. By addressing and rectifying the key limitations of latent consistency models through PCM, the research sets a new benchmark in the field, characterized by its strong performance across multiple settings and tasks. PCM's methodical advancements herald significant improvements in generative model efficiency and output quality, offering promising directions for future exploration and application.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1795647841496387756

https://twitter.com/Gradio/status/1796156029303988277

https://twitter.com/arankomatsuzaki/status/1795638251547771052

https://twitter.com/dreamingtulpa/status/1799140217619664918

https://twitter.com/fywang0126/status/1803273380445249735

https://twitter.com/fly51fly/status/1795938376333607046

YouTube

Show All Videos