StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators (2108.00946v2)

Published 2 Aug 2021 in cs.CV, cs.CL, cs.GR, and cs.LG

Abstract: Can a generative model be trained to produce images from a specific domain, guided by a text prompt only, without seeing any image? In other words: can an image generator be trained "blindly"? Leveraging the semantic power of large scale Contrastive-Language-Image-Pre-training (CLIP) models, we present a text-driven method that allows shifting a generative model to new domains, without having to collect even a single image. We show that through natural language prompts and a few minutes of training, our method can adapt a generator across a multitude of domains characterized by diverse styles and shapes. Notably, many of these modifications would be difficult or outright impossible to reach with existing methods. We conduct an extensive set of experiments and comparisons across a wide range of domains. These demonstrate the effectiveness of our approach and show that our shifted models maintain the latent-space properties that make generative models appealing for downstream tasks.

Citations (194)

View on Semantic Scholar

Summary

The paper introduces a CLIP-based directional loss for text-guided non-adversarial domain adaptation of StyleGAN generators.
It employs a dual-generator framework with an adaptive layer-freezing mechanism to maintain latent space alignment and training stability.
Experimental results on FFHQ, LSUN Church, and AFHQ-Dog demonstrate extensive style and shape transformations without target domain images.

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

The paper "StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators" introduces a sophisticated framework designed for non-adversarial domain adaptation in generative models, particularly focusing on the adaptation of StyleGAN generators. The research outlines a novel methodology that leverages semantic knowledge encapsulated in CLIP (Contrastive Language-Image Pre-training) models to enable the conversion of a StyleGAN generator from one domain to another, guided solely by text prompts, thereby eliminating the traditional necessity for extensive datasets containing images of the target domain.

This approach centers around two pivotal components: StyleGAN2 and CLIP. By introducing a directional loss in CLIP's embedding space, the authors propose an innovative framework that can guide the domain shift of image generators without accessing any image data from the target domains. The architecture utilizes two copies of the generator—one frozen and the other trainable—to maintain latent space alignment effectively and ensure images generated from the same latent codes differ between domains only along the prescribed CLIP-space direction.

The introduction of the directional CLIP loss addresses common challenges posed by adversarial solutions and mode collapse in conventional generative models. By aligning generated images with CLIP-space text directions, this approach maintains diversity and enhances training stability. Furthermore, an adaptive layer-freezing mechanism is proposed to optimize relevant network parts selectively at distinct training iterations, offering improved results for drastic domain shifts.

Compared to existing text-guided synthesis frameworks such as StyleCLIP, the proposed method significantly extends the scope of potential modifications beyond the limitations of the pre-trained generator’s original domain. StyleGAN-NADA facilitates extensive changes in style and shape, including converting photos to paintings or transforming animal species, challenging areas where current methods remain constrained by the available training data.

The experimental section of the paper rigorously tests this training scheme across diverse scenarios, demonstrating successful domain adaptations for models sourced from varied datasets such as FFHQ, LSUN Church, AFHQ-Dog, and others. The paper verifies that adapted generators uphold crucial latent space characteristics, allowing for meaningful edits, image-to-image translations, and inversions — all consistent with the original domain latent space, enabling seamless cross-domain identity preservation.

Implications of these findings strike at the heart of image synthesis research. Practically, StyleGAN-NADA suggests groundbreaking possibilities for creative content generation without dependency on extensive datasets, allowing artists and researchers to explore domains limited only by imaginative text prompts. Theoretically, the success of the model in employing purely text-based guidance prompts an exciting discourse on the role of language-based models in shaping generative tasks and broadens consideration for leveraging pretrained embeddings in visual domains.

The research opens doors to future investigations in immutable training setups devoid of image data, leveraging comprehensive multi-modal embeddings instead. Enhancing generative models with linguistic interfaces could redefine decentralized content creation, though vigilance regarding textual bias remains imperative. Exploring applications of cross-domain synthesis in data-deficient fields or addressing CLIP's bias through controlled few-shot adjustments further engenders practical solutions from StyleGAN-NADA’s groundbreaking framework.

PDF Markdown

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators (2108.00946v2)

Summary

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Related Papers