One-Step Image Translation with Text-to-Image Models (2403.12036v1)

Published 18 Mar 2024 in cs.CV, cs.GR, and cs.LG

Abstract: In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at https://github.com/GaParmar/img2img-turbo.

References (1)

Midjourney v5 dataset. https://huggingface.co/datasets/wanng/midjourney-v5-202304-clean (2023)

Authors (4)

Gaurav Parmar (9 papers)
Taesung Park (24 papers)
Srinivasa Narasimhan (12 papers)
Jun-Yan Zhu (80 papers)

Citations (25)

View on Semantic Scholar

Summary

The paper introduces a novel one-step diffusion framework that replaces iterative denoising with an end-to-end generator, dramatically reducing inference time.
It leverages direct conditioning input, skip connections via zero-convolution, and LoRA weights to adapt pre-trained text-to-image models for effective image translation.
Experimental results show lower FID and DINO-Struct-Dist scores, confirming superior quality and efficient performance in both paired and unpaired settings.

One-Step Image Translation with Text-to-Image Models

The paper "One-Step Image Translation with Text-to-Image Models" addresses crucial challenges faced by existing conditional diffusion models. These models are often burdened with slow inference speeds owing to iterative denoising processes and dependency on paired datasets, which are not always available. This work introduces a methodology for adapting a single-step diffusion model to new tasks and domains using adversarial learning objectives, focusing primarily on image-to-image translation tasks.

Background and Motivation

Conditional diffusion models have significantly advanced the state-of-the-art in image generation, enabling users to generate images based on spatial conditioning and text prompts. However, the iterative nature of these models poses a significant limitation for real-time applications due to slow inference speeds. Additionally, the requirement for extensive paired datasets for model training further limits the practical applicability of these models. The introduction of methods to overcome these challenges is both timely and necessary.

Methodology

The authors propose a novel one-step image-to-image translation method applicable to both paired and unpaired settings. Their approach leverages pre-trained text-conditional diffusion models like SD-Turbo, adapting them to new domains and tasks through adversarial learning objectives. The architecture consolidates the separate modules of the vanilla latent diffusion model into a single end-to-end generator network. This integration maintains the structure of the input image and reduces overfitting by using a smaller set of trainable weights.

Key Contributions:

Direct Conditioning Input: Unlike standard diffusion adapters like ControlNet, the proposed model directly incorporates the conditioning input to avoid conflicts between noise maps and input conditioning, which are prevalent in one-step models.
Skip Connections and Adaptation: The architecture employs skip connections between the encoder and decoder via zero-convolution, helping preserve high-frequency details of the input image. The adaptation is further facilitated by LoRA (Low-Rank Adaptation) weights to fine-tune the original network.
Adversarial Objectives in Unpaired Settings: For unpaired translation tasks, such as day-to-night conversion, the CycleGAN-Turbo variant significantly outperforms existing methods in terms of distribution matching and preservation of input structure. The approach combines cycle consistency and adversarial losses to robustly map images from a source domain to a target domain.
Efficiency and Quality: The method reduces the inference process to a single step, drastically improving efficiency without compromising the visual quality of the generated images. For paired settings, the pix2pix-Turbo variant achieves results on par with state-of-the-art methods while maintaining single-step inference capability.

Experimental Results

The paper presents extensive experimental results across standard datasets (e.g., Horse → Zebra, Yosemite Summer → Winter) and complex, high-resolution driving scenes (e.g., day → night, clear → foggy conversions). The results showcase the method's superiority in both quality (evaluated by FID and DINO-Struct-Dist metrics) and processing speed, as compared to multiple strong baselines from GAN-based and diffusion-based methods.

Key Observations:

FID and DINO-Struct-Dist Metrics: The proposed method consistently achieves lower FID scores, indicating better alignment with target domain distributions, and lower DINO-Struct-Dist values, suggesting superior preservation of input structure.
Efficiency: The single-step inference time (0.13-0.29 seconds per image) is substantially lower than multi-step baselines, demonstrating practical applicability for real-time systems.

Implications and Future Work

The implications of this research are profound for both theoretical advancements and practical applications in AI. The paper demonstrates that single-step diffusion models, when appropriately adapted, can serve as robust and versatile backbones for various image synthesis tasks. This approach opens the door to more efficient real-time applications in areas such as interactive image editing and autonomous driving.

Future Directions:

Guidance Control and Negative Prompts: Future research could explore enhanced guidance control mechanisms and support for negative prompts, potentially through guided distillation methods.
Memory Optimization: Addressing the memory-intensive nature of cycle-consistency loss and high-capacity generators will be crucial for scaling to higher-resolution images.
Real World Applications: Further exploration in dynamic environments, such as video frames for real-time video translation, could amplify this method's impact.

The proposed method’s strong numerical results, combined with its practical efficiency, pave the way for broader applicability of diffusion-based models in real-world tasks, enriching both academic research and industrial applications.

Overall, this paper makes a compelling case for adapting pre-trained text-to-image models for single-step image translation, providing a significant leap forward in addressing the limitations of existing diffusion-based methods.