PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding (2312.04461v1)

Published 7 Dec 2023 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However, existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency, promising identity (ID) fidelity, and flexible text controllability. In this work, we introduce PhotoMaker, an efficient personalized text-to-image generation method, which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding, serving as a unified ID representation, can not only encapsulate the characteristics of the same input ID comprehensively, but also accommodate the characteristics of different IDs for subsequent integration. This paves the way for more intriguing and practically valuable applications. Besides, to drive the training of our PhotoMaker, we propose an ID-oriented data construction pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline, our PhotoMaker demonstrates better ID preservation ability than test-time fine-tuning based methods, yet provides significant speed improvements, high-quality generation results, strong generalization capabilities, and a wide range of applications. Our project page is available at https://photo-maker.github.io/

Citations (112)

View on Semantic Scholar

Summary

The paper introduces stacked ID embedding to fuse multiple input images, efficiently preserving unique identity characteristics in generated photos.
It employs an ID-oriented data pipeline that constructs a unified representation for improved text controllability and realistic synthesis.
The method outperforms traditional models like DreamBooth by reducing computational overhead while enabling flexible identity mixing.

Overview of Personalized Text-to-Image Generation

The field of text-to-image generation has witnessed a significant advancement in recent years. Notably, the development has led to the synthesis of human photos that meet specific textual descriptions. A notable development, PhotoMaker, aims to enhance personalized text-to-image generation by embedding identities (or IDs) into images while adhering to a given text prompt efficiently. In contrast to existing approaches, PhotoMaker is designed for high efficiency without compromising on identity preservation and text controllability.

Methodology and Approach

The PhotoMaker methodology centers on what is termed "stacked ID embedding." This process involves taking an arbitrary number of input ID images and encoding them into a unified ID representation. The strength of this approach lies in its ability to preserve the unique characteristics of individual IDs and yet be flexible enough to integrate these characteristics when needed. PhotoMaker's ability to work efficiently with multiple encoded IDs is in sharp contrast to previous methods like DreamBooth, which require substantial computational resources and time for customization. Furthermore, the development of an ID-oriented data construction pipeline is a critical component of PhotoMaker, enabling the synthesis of a dataset that feeds the training required by the model.

Capabilities and Applications

PhotoMaker can handle various exciting applications. It demonstrates the flexibility to transform characteristics like changing attributes, morphing characters from artworks, or merging multiple identities into one. Notably, its innovative approach allows for identity mixing, where the generated photo-realistically retains aspects of multiple input identities. Additionally, the interface allows users to adjust the merge ratio of different IDs by controlling the share of images in the input sample pool or using prompt weighting.

Conclusion and Implications

In summary, PhotoMaker stands as an efficient method for generating personalized human images that are realistic and preserve ID fidelity. Its ability to generate diverse images based on text prompts quickly makes it a significant stride in digital image creation. Its applications are vast, from entertainment to virtual reality. However, it goes without saying that ethical considerations are paramount with such powerful technology. It's vital that PhotoMaker and methods like it are used responsibly and with consideration of potential misuses.