Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for "personalization" of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features. We also provide a new dataset and evaluation protocol for this new task of subject-driven generation. Project page: https://dreambooth.github.io/
The paper introduces a method to personalize text-to-image diffusion models for generating realistic images of a specific subject in various scenes.
The ability to generate realistic images from text prompts has seen remarkable progress with the advent of large text-to-image models. Despite their success, a significant limitation of these models is their inability to accurately preserve the appearance of specific subjects across different contexts. The work presented addresses this gap by introducing a novel approach to personalize text-to-image diffusion models, allowing for the generation of photorealistic images of a particular subject in a variety of scenes, poses, and lighting conditions.
At the core of the proposed method is the fine-tuning of a pre-trained text-to-image diffusion model with a small number of images of a specific subject. The process involves embedding the subject into the model's output domain, making it possible to generate novel images of the subject using a unique identifier. Key to this method is a novel loss function, termed autogenous class-specific prior preservation loss, which leverages the semantic prior embedded within the model. This approach ensures that the fine-tuned model can generate diverse renditions of the subject without deviating significantly from its original appearance or the characteristics of its class.
The implementation involves three key steps:
The researchers conducted extensive experiments to showcase the versatility of the technique, demonstrating its applicability in recontextualizing subjects, modifying their properties, and generating artistic renditions. Notably, the method proved capable of preserving the unique features of subjects across all generated images. Evaluation metrics, including a newly proposed DINO metric optimized for subject fidelity, highlighted the method's efficacy in maintaining subject and prompt fidelity.
Comparisons with concurrent work reveal the superior capability of the presented method in both preserving subject identity and adhering to prompts. Notably, the approach outperformed existing methods, including Textual Inversion, across several fidelity metrics.
The paper discusses several limitations, including challenges in specific contexts where model performance may degrade due to weak priors or difficulty in accurately generating the intended environment. Despite these limitations, the work represents a significant step forward in personalized image generation.
The research opens up exciting avenues for future work, including potential applications in generating personalized content and exploring new forms of artistic expression. Additionally, the methodology lays the groundwork for further exploration into the fine-tuning of generative models for personalized applications.
This work presents a groundbreaking approach to personalizing text-to-image diffusion models, enabling the generation of highly realistic and contextually varied images of specific subjects. Through careful fine-tuning and innovative loss functions, the method achieves remarkable success in preserving subject identity across a wide range of generated images, marking a significant advancement in the field of generative AI.