Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control (2405.12970v2)

Published 21 May 2024 in cs.CV

Abstract: Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed for high-precision and high-fidelity face editing for pre-trained diffusion models. We observe that both face reenactment/swapping tasks essentially involve combinations of target structure, ID and attribute. We aim to sufficiently decouple the control of these factors to achieve both tasks in one model. Specifically, our method contains: 1) A Spatial Condition Generator that provides precise landmarks and background; 2) A Plug-and-play Identity Encoder that transfers face embeddings to the text space by a transformer decoder. 3) An Attribute Controller that integrates spatial conditions and detailed attributes. Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality compared to fully fine-tuned face reenactment/swapping models. Additionally, Face-Adapter seamlessly integrates with various StableDiffusion models.

Authors (10)

Yue Han (25 papers)
Junwei Zhu (20 papers)
Keke He (6 papers)
Xu Chen (413 papers)
Yanhao Ge (15 papers)
Wei Li (1122 papers)
Xiangtai Li (128 papers)
Jiangning Zhang (102 papers)
Chengjie Wang (178 papers)
Yong Liu (721 papers)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces Face-Adapter, a novel approach leveraging pre-trained diffusion models for high-precision face reenactment and swapping tasks.
It integrates three key modules—Spatial Condition Generator, Identity Encoder, and Attribute Controller—to maintain consistent identity, accurate spatial guidance, and detailed attribute preservation.
Experimental results show state-of-the-art performance with PSNR of 22.36, LPIPS of 0.1281, 96.47% ID similarity, and 0.0607 gaze error, minimizing computational overhead.

Fine-Grained Control of Faces in Diffusion Models

Introduction

The field of face reenactment and swapping has traditionally relied heavily on GAN (Generative Adversarial Network) frameworks. These models have delivered reasonable performance but hit significant roadblocks with extreme face poses and subtle facial attribute variations. The paper introduces a novel approach called Face-Adapter, leveraging pre-trained diffusion models. This method aims for higher precision and fidelity in face reenactment and swapping tasks without extensive re-training.

Core Components of Face-Adapter

To tackle the challenges of face reenactment and swapping, Face-Adapter integrates three primary modules:

Spatial Condition Generator (SCG)
Identity Encoder (IE)
Attribute Controller (AC)

Let's break down what each component does.

Spatial Condition Generator (SCG)

The SCG is responsible for providing accurate and adaptive spatial guidance. It achieves this through two sub-modules:

3D Landmark Projector: This extracts and projects 3D facial landmarks by combining identity coefficients from the source image with expression and pose coefficients from the target image.
Adapting Area Predictor: This predicts the regions needing regeneration (e.g., the face for swapping or background for reenactment) and assists in maintaining environmental consistency such as lighting and spatial references.

Identity Encoder (IE)

The IE focuses on transferring high-quality face embeddings into the text space through a lightweight transformer decoder. This mapping is crucial for maintaining the identity's consistency in the generated images. Notably, this process avoids the need for heavy texture encoders or additional identity networks, simplifying the overall architecture.

Attribute Controller (AC)

This module integrates spatial landmarks and preserves essential attributes such as lighting and hair:

Spatial Control: Combines static background regions with dynamic target motion landmarks.
Attribute Template: Fills in missing attributes using embeddings extracted from a pre-trained CLIP model, scaled down for efficiency.

Numerical Results and Performance

The paper compares Face-Adapter to several state-of-the-art (SoTA) methods across tasks like face reenactment and swapping. Let's take a look at the quantitative results:

Face Reenactment

Evaluated on the VoxCeleb1 test set, Face-Adapter outperforms several GAN-based and diffusion-based methods in overall quality and identity preservation, as measured by metrics like PSNR, LPIPS, and FID.

PSNR: Face-Adapter scored 22.36, closely matching other advanced methods but excelling in background consistency.
LPIPS: With a score of 0.1281, it shows significant improvement in perceptual quality over many GAN-based competitors.

Face Swapping

On the FaceForensics++ dataset, Face-Adapter demonstrated superior performance, especially in handling substantial facial shape changes and maintaining ID similarity while managing background inpainting effectively.

ID Similarity: 96.47, comparable to SoTA GAN models.
Gaze Error: With a score of 0.0607, it indicates precise control over facial attributes.

Real-World Implications

The practical implications of this research are substantial. Face-Adapter offers a robust, plug-and-play solution for facial reenactment and swapping tasks, potentially reducing the computational overhead and training complexities associated with existing methods. This can be particularly useful in applications like animated movie production, real-time video conferencing enhancements, and AR/VR experiences.

Limitations and Future Directions

While Face-Adapter excels in many areas, it does face challenges in maintaining temporal stability in video sequences, an area ripe for future research. Improving temporal consistency could make this approach even more beneficial for video editing and real-time applications.

Conclusion

Face-Adapter represents a nuanced yet straightforward approach to facial reenactment and swapping tasks. By leveraging pre-trained diffusion models and introducing efficient components like the SCG, IE, and AC, the method delivers high-quality, precise, and computationally efficient results. As technology evolves, further enhancements to Face-Adapter can propel its use in broader, more dynamic contexts, ultimately pushing the boundaries of facial editing technology in AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1793138960018923528

https://twitter.com/Gradio/status/1793662402946277868

https://twitter.com/taziku_co/status/1793959820615385552

https://twitter.com/aipaperspodcast/status/1793676369114677335

https://twitter.com/_vztu/status/1811160867544911883

https://twitter.com/gm8xx8/status/1793165943776145877

YouTube

Show All Videos