A Latent Transformer for Disentangled Face Editing in Images and Videos

Published 22 Jun 2021 in cs.CV | (2106.11895v2)

Abstract: High quality facial image editing is a challenging problem in the movie post-production industry, requiring a high degree of control and identity preservation. Previous works that attempt to tackle this problem may suffer from the entanglement of facial attributes and the loss of the person's identity. Furthermore, many algorithms are limited to a certain task. To tackle these limitations, we propose to edit facial attributes via the latent space of a StyleGAN generator, by training a dedicated latent transformation network and incorporating explicit disentanglement and identity preservation terms in the loss function. We further introduce a pipeline to generalize our face editing to videos. Our model achieves a disentangled, controllable, and identity-preserving facial attribute editing, even in the challenging case of real (i.e., non-synthetic) images and videos. We conduct extensive experiments on image and video datasets and show that our model outperforms other state-of-the-art methods in visual quality and quantitative evaluation. Source codes are available at https://github.com/InterDigitalInc/latent-transformer.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (74)

View on Semantic Scholar

Summary

The paper introduces a latent transformation network that enables selective and disentangled manipulation of facial attributes in StyleGAN's latent space.
The paper incorporates identity preservation and attribute regularization constraints, ensuring precise edits with minimal interference on non-target features.
The paper extends its methodology to video editing, offering a stable pipeline for consistent frame-by-frame face editing with high fidelity.

A Latent Transformer for Disentangled Face Editing in Images and Videos

The paper "A Latent Transformer for Disentangled Face Editing in Images and Videos" introduces a novel methodology for facial attribute manipulation using the latent spaces of generative adversarial networks, specifically focusing on StyleGAN. The primary goal is to enable precise and identity-preserving edits to facial attributes in both images and videos, enhancing the capabilities of post-production processes in media industries.

Key Contributions

Latent Transformation Network: The authors propose a dedicated latent transformation network to selectively manipulate facial attributes within the comprehensive latent space of a StyleGAN generator. This network aims to achieve disentangled and precise edits, ensuring that changing one attribute minimally affects others.
Disentanglement and Identity Preservation: The paper integrates explicit disentanglement and identity preservation constraints into the loss function, which are crucial for maintaining the individual's identity post-manipulation. This is particularly important for applications that demand high fidelity, such as film editing.
Video Editing Pipeline: A significant advancement presented is a pipeline that extends these editing capabilities to video sequences. By employing a stable and consistent editing mechanism, this approach addresses the complexities of continuous frames and identity preservation across temporal sequences.

Methodology

The proposed method involves projecting real images into the latent space of StyleGAN utilizing an inversion technique. A latent transformation network then applies linear transformations to these latent codes to achieve specific attribute changes. The transformation model is trained through three main objectives:

Classification Loss: Ensures effective manipulation of the target attribute.
Attribute Regularization: Maintains non-target attributes unchanged.
Latent Code Regularization: Preserves identity by keeping the modified latent code close to its original state.

The combination of these objectives results in high-quality, controllable alterations with minimal identity distortion.

Experimental Evaluation

Experimental evaluation demonstrates the method's superiority over existing state-of-the-art approaches like InterFaceGAN and GANSpace. These methods often suffer from entanglement issues where changing one attribute inadvertently alters others. The presented approach provides more accurate and isolated control over facial attributes.

The authors further conducted quantitative assessments using metrics for target attribute change rate, attribute preservation rate, and identity preservation score. Their method showed a superior balance between attribute change and identity preservation, affirming its effectiveness.

Practical and Theoretical Implications

From a practical standpoint, this technique could significantly enhance post-production processes by providing artists with fine-grained control over facial edits, improving the efficiency and quality of media content refinement. Theoretically, it advances the understanding of disentangled representations in the latent spaces of generative models and their applications in real-world data manipulation.

Future Directions

The paper suggests potential improvements, particularly in addressing limitations when dealing with extreme poses and expressions. Future work could involve joint training of the encoder and generator or refining the training dataset to better cover diverse facial orientations and attributes. Moreover, the extension of these techniques beyond facial attributes to other domains marks an intriguing direction for research expansion.

This paper contributes valuable insights and tools for the multimedia and AI communities, providing a robust framework for disentangled facial attribute editing in both static and dynamic contexts.

Markdown Report Issue