JoJoGAN: One Shot Face Stylization (2112.11641v4)

Published 22 Dec 2021 in cs.CV

Abstract: A style mapper applies some fixed style to its input images (so, for example, taking faces to cartoons). This paper describes a simple procedure -- JoJoGAN -- to learn a style mapper from a single example of the style. JoJoGAN uses a GAN inversion procedure and StyleGAN's style-mixing property to produce a substantial paired dataset from a single example style. The paired dataset is then used to fine-tune a StyleGAN. An image can then be style mapped by GAN-inversion followed by the fine-tuned StyleGAN. JoJoGAN needs just one reference and as little as 30 seconds of training time. JoJoGAN can use extreme style references (say, animal faces) successfully. Furthermore, one can control what aspects of the style are used and how much of the style is applied. Qualitative and quantitative evaluation show that JoJoGAN produces high quality high resolution images that vastly outperform the current state-of-the-art.

Citations (63)

View on Semantic Scholar

Summary

The paper introduces a novel one-shot face stylization method using GAN inversion and style mixing to rapidly generate high-quality stylized images.
It employs a unique perceptual loss based on StyleGAN discriminator activations to preserve critical facial details and overcome traditional loss limitations.
User studies indicate a strong preference for JoJoGAN over state-of-the-art methods, highlighting its potential in personalized avatar creation and digital art applications.

JoJoGAN: One-Shot Face Stylization

The paper "JoJoGAN: One Shot Face Stylization" introduces a novel method for image stylization, addressing critical challenges in the domain of generative models. The approach leverages the powerful capabilities of StyleGAN, simplifying the process to learn a style mapper from a single example of the desired style. This paper is significant for its methodological advancements and potential implications across various domains, including artistic rendering and personalized content creation.

Methodology and Contributions

JoJoGAN's methodology is centered around a sequence of GAN inversion and style mixing procedures, yielding substantial paired datasets from limited style references. This approach circumvents the traditional requirement for large datasets, often unattainable or impractical for unique styles such as specific portraits or extreme stylizations. By utilizing just one style reference image and as little as 30 seconds of training time, JoJoGAN demonstrates remarkable efficiency in stylization tasks.

The process consists of four key steps:

GAN Inversion: Transforming the reference style image into a style code via GAN inversion, producing realistic face images.
Training Set Creation: Generating a set of style codes, close to the reference, using StyleGAN's style-mixing properties.
Fine-tuning StyleGAN: Utilizing a novel perceptual loss based on StyleGAN's discriminator activations to refine outputs.
Inference: Applying the stylization map on input images to achieve desired stylization effects.

A pivotal component of JoJoGAN is its perceptual loss function that replaces traditional LPIPS loss functions, resolving detail preservation issues associated with scale discrepancies. This is achieved by exploiting the rich feature space of a pretrained StyleGAN discriminator, tailored for high-resolution image synthesis.

Numerical Results and Claims

JoJoGAN's qualitative and quantitative evaluation strongly supports the superiority of this methodology over existing SOTA techniques. User studies conducted show an overwhelming preference for JoJoGAN, establishing its effectiveness in transferring delicate style features while preserving the input's identity. Despite lagging behind in terms of FID scores due to inherent limitations in measuring stylization fidelity, the qualitative results showcase significant improvements in stylization detail capture.

Implications and Future Work

The implications of JoJoGAN are profound, promising advancements in fields requiring personalized and detailed stylization without necessitating extensive datasets. Practically, it enables applications in personal avatar creation, art restoration, or any creative domain demanding precise stylization. The theoretical underpinnings propose a flexible framework to explore style transfer across various domains, potentially extending beyond human faces to other object classes, as briefly demonstrated with the LSUN-Churches dataset.

JoJoGAN’s reliance on GAN paradigms and StyleGAN architecture sets the stage for further research into generative models that can efficiently generate quality stylizations from minimal data points. Future developments may delve into automating style reference selection and enhancing control over specific style aspects, accommodating even more diverse user requirements. Additionally, exploration into the integration with multimodal inputs, such as text-based style descriptions, could yield richer, more intuitive stylization techniques.

In conclusion, JoJoGAN stands as a significant contribution to the field of generative models, showcasing a pragmatic route towards scalable, efficient, and high-quality style transfer. This paper may serve as a reference point for subsequent innovations in rapid stylization mechanisms, indicating a promising horizon for AI-driven creativity.

PDF Markdown

Related Papers

YouTube

Show All Videos