Emergent Mind

StableIdentity: Inserting Anybody into Anywhere at First Sight

(2401.15975)
Published Jan 29, 2024 in cs.CV

Abstract

Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.

Overview

  • StableIdentity presents a novel method for transferring a person's identity from a single photo into various contexts, maintaining consistency and allowing customization.

  • It uses a pretrained face encoder that works together with an identity prior derived from celebrity names to achieve stable identity representation.

  • The method incorporates a unique masked two-phase diffusion loss to ensure detail precision and identity preservation across various generative contexts.

  • It demonstrates superior performance in maintaining identity over other methods and can be applied to video and 3D models without additional tuning.

  • This technology has wide implications, potentially revolutionizing digital identity creation in entertainment, virtual reality, and beyond.

Overview

StableIdentity introduces a novel approach for inserting a target subject's identity—taken from a single image—into diverse contexts guided by textual descriptions. This paper introduces a method that not only preserves identity attributes with remarkable consistency but also offers flexible editability across various applications like personalized portraits, virtual try-ons, and art & design.

Methodology

At the core of StableIdentity lies a face encoder integrated with an identity prior. The face encoder is pretrained to recognize facial features effectively, and this capability is utilized to encode the identity of an input face image. The innovation extends towards leveraging an editable prior constructed from celebrity names. These names, readily available in extensive text-to-image model datasets, come with a rich prior that ensures the learned identity is consistent across different contexts. The authors effectively integrate this identity prior and editability prior into a single model to address previous limitations of identity preservation and flexibility in customization.

The approach is further augmented by a masked two-phase diffusion loss. This loss function is designed to optimize the generative model's ability to reconstruct and stabilize the identity across a plethora of generated contexts. It ensures that the pixel-level details of the face remain precise and that the diversity in generation does not compromise the inherent identity features.

Experimental Results

Extensive experiments showcase Superior performance over previous customization methods, with an effective and prominent ability to maintain identity consistency. The method is adept at combining with existing image-level modules and unlocks the generalization ability to inject learned identity from a single image into video or 3D generation without further fine-tuning.

Implications and Future Directions

The significance of such a framework is manifold. The capability to combine identity priors and editability into a unified architecture is a remarkable stride in the field of human-centric generation. It is not just the preservation of identity or the fidelity of the output that is laudable but the efficiency with which these results are achieved. StableIdentity's ability to extend identity-driven customization to video and 3D models without the need for elaborate fine-tuning demonstrates a potential paradigm shift in how personalized content can be generated.

The implications of this technology extend to various domains, from entertainment and personal digital content creation to potential applications in virtual reality and AI-driven avatar creation. Moving forward, this approach could transform the nexus between personalized digital identity and a multitude of virtual platforms, making identity a flexible, yet stable construct, adaptable to contexts limited only by textual creativity.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.