Emergent Mind

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

(2208.12242)
Published Aug 25, 2022 in cs.CV , cs.GR , and cs.LG

Abstract

Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for "personalization" of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features. We also provide a new dataset and evaluation protocol for this new task of subject-driven generation. Project page: https://dreambooth.github.io/

Overview

  • The paper introduces a method to personalize text-to-image diffusion models for generating realistic images of a specific subject in various scenes.

Fine-Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Introduction

The ability to generate realistic images from text prompts has seen remarkable progress with the advent of large text-to-image models. Despite their success, a significant limitation of these models is their inability to accurately preserve the appearance of specific subjects across different contexts. The work presented addresses this gap by introducing a novel approach to personalize text-to-image diffusion models, allowing for the generation of photorealistic images of a particular subject in a variety of scenes, poses, and lighting conditions.

Methodology

At the core of the proposed method is the fine-tuning of a pre-trained text-to-image diffusion model with a small number of images of a specific subject. The process involves embedding the subject into the model's output domain, making it possible to generate novel images of the subject using a unique identifier. Key to this method is a novel loss function, termed autogenous class-specific prior preservation loss, which leverages the semantic prior embedded within the model. This approach ensures that the fine-tuned model can generate diverse renditions of the subject without deviating significantly from its original appearance or the characteristics of its class.

Implementation Details

The implementation involves three key steps:

  1. Subject Embedding: Achieved by fine-tuning the model using images of the subject paired with text prompts containing a unique identifier followed by a class name (e.g., "A [V] dog"), embedding the subject into the model's output domain.
  2. Rare-token Identifiers: Utilization of rare token identifiers to minimize the chances of the chosen identifiers having strong pre-existing associations within the model, thus maintaining the integrity of the generated images.
  3. Prior Preservation Loss: Introduction of a class-specific prior-preservation loss to counteract language drift and maintain the diversity of output, which is crucial for generating the subject in various contexts and viewpoints.

Experiments and Results

The researchers conducted extensive experiments to showcase the versatility of the technique, demonstrating its applicability in recontextualizing subjects, modifying their properties, and generating artistic renditions. Notably, the method proved capable of preserving the unique features of subjects across all generated images. Evaluation metrics, including a newly proposed DINO metric optimized for subject fidelity, highlighted the method's efficacy in maintaining subject and prompt fidelity.

Comparative Analysis

Comparisons with concurrent work reveal the superior capability of the presented method in both preserving subject identity and adhering to prompts. Notably, the approach outperformed existing methods, including Textual Inversion, across several fidelity metrics.

Discussion

The paper discusses several limitations, including challenges in specific contexts where model performance may degrade due to weak priors or difficulty in accurately generating the intended environment. Despite these limitations, the work represents a significant step forward in personalized image generation.

Future Directions

The research opens up exciting avenues for future work, including potential applications in generating personalized content and exploring new forms of artistic expression. Additionally, the methodology lays the groundwork for further exploration into the fine-tuning of generative models for personalized applications.

Conclusion

This work presents a groundbreaking approach to personalizing text-to-image diffusion models, enabling the generation of highly realistic and contextually varied images of specific subjects. Through careful fine-tuning and innovative loss functions, the method achieves remarkable success in preserving subject identity across a wide range of generated images, marking a significant advancement in the field of generative AI.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.