CCM: Adding Conditional Controls to Text-to-Image Consistency Models (2312.06971v1)

Published 12 Dec 2023 in cs.CV

Abstract: Consistency Models (CMs) have showed a promise in creating visual content efficiently and with high quality. However, the way to add new conditional controls to the pretrained CMs has not been explored. In this technical report, we consider alternative strategies for adding ControlNet-like conditional control to CMs and present three significant findings. 1) ControlNet trained for diffusion models (DMs) can be directly applied to CMs for high-level semantic controls but struggles with low-level detail and realism control. 2) CMs serve as an independent class of generative models, based on which ControlNet can be trained from scratch using Consistency Training proposed by Song et al. 3) A lightweight adapter can be jointly optimized under multiple conditions through Consistency Training, allowing for the swift transfer of DMs-based ControlNet to CMs. We study these three solutions across various conditional controls, including edge, depth, human pose, low-resolution image and masked image with text-to-image latent consistency models.

References (30)

Citations (9)

View on Semantic Scholar

Summary

The paper demonstrates that integrating ControlNets via consistency training outperforms DM-based methods in managing fine image details.
The methodology compares applying an existing ControlNet, training one from scratch, and using a lightweight adapter to merge conditional controls into consistency models.
Experiments across various visual conditions show that the unified adapter significantly boosts generation quality in edge, depth, human pose, and masked image tasks.

Introduction

Consistency Models (CMs) are gaining attention for their capability to generate high-quality images with efficiency. Despite their advancements, integrating new conditional controls into pretrained CMs remains an uncharted area. Aiming to enhance CMs with ControlNet—a system developed for diffusion models (DMs)—this technical report explores three strategies, evaluating their effectiveness across different visual conditions.

Methodology

The approach begins with establishing a baseline text-to-image CM, which could be a model trained via consistency distillation from DMs or by direct training from data. For the first strategy, an existing ControlNet optimized for DMs is applied to CMs. The second strategy focuses on training a ControlNet from scratch specifically for the CM using consistency training. Lastly, the third strategy introduces a lightweight adapter to seamlessly transplant multiple DM-based ControlNets into a CM environment.

Experimental Setup

The strategies were assessed on a variety of visual conditions: edge, depth, human pose, low-resolution image, and masked image, using their respective specialized extraction or detection techniques. The experimental process encompassed rigorous GPU computation to train the foundational CM, ControlNets, and the unified adapter.

Findings and Conclusion

The experiments revealed that while DM-based ControlNets could endow CMs with high-level semantic control, they often stumbled to effectively manage low-level details. Conversely, ControlNets trained via consistency training for CMs showcased superior conditional image generation capabilities. The fusion of DMs-based ControlNets with the CMs was notably improved with the use of a unified adapter, achieving better visual outcomes. These results illustrate the potential of tailored training strategies for integrating conditional controls into CMs, highlighting a methodical path forward in the field of image generation.

PDF Markdown