Emergent Mind

Arc2Face: A Foundation Model of Human Faces

(2403.11641)
Published Mar 18, 2024 in cs.CV

Abstract

This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.

Arc2Face conditions Stable Diffusion on ID features using ArcFace and CLIP for enhanced face recognition.

Overview

  • Arc2Face presents a foundation model specialized in generating high-fidelity images of human faces, focusing on preserving facial identity through conditioning on identity embeddings.

  • The model advances by utilizing the WebFace42M dataset for training, significantly larger than datasets previously used, leading to higher quality and diversity in generated images.

  • By adapting the Stable Diffusion model to operate with ArcFace identity embeddings and optimizing it with high-resolution upscaled datasets, Arc2Face surpasses previous methods in generating identity-consistent images.

  • Arc2Face demonstrates potential for broad applications, including enhancing face recognition systems and controlled facial attribute manipulation, while raising questions on ethical usage and diversity representation.

Arc2Face: Constructing a Foundation Model for Human Faces

Introduction

The exploration of generative models for facial image synthesis has seen significant advancements with the development of Generative Adversarial Networks (GANs), specifically StyleGAN and its successors. However, despite the successes, challenges such as maintaining identity consistency in generated images persist. Recently, diffusion models have revealed capabilities beyond representing and generating image distributions, particularly when directed by identity features like those from ArcFace, heralding a new direction in subject-specific image generation. Arc2Face addresses the challenge of generating high-fidelity images conditioned on facial identity embeddings, leveraging the largest public dataset for face recognition, WebFace42M, to train a robust model that advances the state of the art in identity-consistent image synthesis.

Related Work

The body of work related to Arc2Face spans multiple domains involving generative models, facial image generation, and identity embedding utilization. Notably, style-based GANs represented a significant leap in image generation quality, albeit with limitations in controlling identity attributes. The advent of diffusion models marked a significant milestone due to their ability to sample high-quality images conditioned on textual descriptions, with extensions enabling subject-specific manipulations. However, existing methods integrating CLIP features with identity embeddings exhibit limitations in generating identity-consistent faces, a gap Arc2Face aims to bridge.

Methodology

Arc2Face introduces a novel approach to generating photorealistic images of human faces from identity embeddings. The method builds on a pre-trained Stable Diffusion model, adapting it to conditionally generate images based solely on identity vectors from the ArcFace model. By meticulously upsampling and curating a significant portion of the WebFace42M dataset, Arc2Face leverages high-resolution facial images with a wide range of identity and intra-class variability to achieve a robust identity-to-face foundation model. Notably, the model abstains from combining identity vectors with textual embeddings, addressing the issues seen in text-augmented models where identity and text are entangled.

Dataset and Training

The scarcity of high-quality, high-resolution datasets with sufficient identity diversity poses a hurdle in training effective ID-conditioned models. Arc2Face circumvents this by upscaling the WebFace42M dataset using a state-of-the-art face restoration network (GFPGAN), thus generating a high-resolution version fit for training the model. The subsequent fine-tuning process on additional high-quality datasets like FFHQ and CelebA-HQ refines the model’s ability to generate detailed and photorealistic facial images.

Results and Discussion

Arc2Face demonstrates superior performance in generating facial images that maintain high fidelity to the input identity embeddings, significantly outperforming existing methods in identity preservation metrics. The model’s efficacy is further illustrated through its ability to support diverse applications like synthetic data generation for face recognition training, where it notably improves performance on benchmark datasets. The introduction of ControlNet integration exemplifies the model's flexibility in generating images with controlled facial attributes, enhancing its utility in real-world applications.

Future Directions and Impact

Arc2Face represents a significant advance in the generative model domain, specifically for applications requiring high fidelity in identity preservation. The model's foundations open up numerous possibilities for future research and development, including but not limited to enhancements in diversity representation, application in digital media and entertainment, and ethical considerations in synthetic content generation. Importantly, the release of this model to the public domain encourages broader engagement with its capabilities and responsible utilization in various fields of application.

Conclusion

Arc2Face sets a new benchmark in the generation of photorealistic, identity-consistent human facial images. By effectively leveraging large-scale, upscaled datasets and focusing on identity embeddings as a sole condition, it addresses key challenges in the field and opens up new avenues for exploration and application in both academic research and industrial development.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.