Emergent Mind

Abstract

Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.

Novel architecture, \modelname, for personalized image generation using modified latent diffusion model with RSA layers.

Overview

  • Salesforce AI Research introduces BootPIG, a novel architecture for zero-shot personalized image generation within pretrained diffusion models.

  • BootPIG employs a dual UNet structure featuring a Base UNet and Reference UNet to incorporate visual features from reference images into the generative process.

  • The architecture introduces a Reference Self-Attention mechanism to facilitate the personalization of generated images without subject-specific finetuning.

  • BootPIG's training is efficient, using synthetic data and completing within about an hour on 16 A100 GPUs, offering a competitive edge over other methods.

  • Evaluations demonstrate that BootPIG outperforms zero-shot methods and is on par with or exceeds finetuned methods in terms of subject fidelity and prompt compliance.

Introduction to BootPIG

Recent advancements in generative models have been notable, particularly in the realm of text-to-image synthesis. While generative diffusion models have exhibited proficiency in creating images from textual prompts, one area that requires further research is personalized image generation - the ability to generate images of specific objects within varied user-defined contexts.

Salesforce AI Research introduces a novel architecture, termed BootPIG, aiming to achieve zero-shot subject-driven generation within pretrained diffusion models. The key innovation with BootPIG is its capacity to inject features from reference images into the generative process, enabling personalized image synthesis without the need for subject-specific finetuning.

Architecture Overview

BootPIG proposes minimal modifications to a pretrained text-to-image diffusion model, essentially splitting it into two parts: the Base UNet and the Reference UNet. The Reference UNet functions to capture the visual features from reference images of the subject, which are then fed into the attention layers of the Base UNet. The architecture utilizes a novel Reference Self-Attention mechanism, replacing standard self-attention layers within the diffusion model, allowing for personalization cues to be incorporated into the generation process.

The practicality of BootPIG lies in its bootstrapped training procedure. Instead of relying on a large curated dataset, BootPIG learns personalization through synthetic data, procured through the stablefusions of images synthesized by the pretrained models and segmentation capabilities. Notably, the training of BootPIG can be completed in a matter of about an hour on 16 A100 GPUs, an efficiency leap compared to existing Zero-Shot Inference methods or Test-time Finetuning methods.

Evaluation and Findings

BootPIG has been put through rigorous tests against standard datasets like DreamBooth. The experiments demonstrate that it outperforms existing zero-shot methods significantly in both subject and prompt fidelity and is competitive with or surpasses test-time finetuning methods. Crucially, these achievements are undergirded by an inference mechanism adept at leveraging multiple reference images, augmenting detail and nuance in the final image synthesis.

Moreover, a user study confirms that BootPIG is consistently preferred over established methods in terms of subject fidelity and prompt compliance. These findings underscore BootPIG's potential for wide-ranging applications, including personalized storytelling and design, hinting at a new direction in subject-driven generation research.

Concluding Thoughts on BootPIG

The BootPIG architecture by Salesforce AI Research grants new capabilities to pretrained diffusion models, pushing forward the boundary of personalized image generation. By harnessing bootstrapped learning procedures and synthesizing data that avoid hard reliance on curated datasets, BootPIG stands out for its efficiency and quality of output.

Given the rapid developments in AI and image generation, the BootPIG model represents a significant leap in the precision and customization available to users – allowing creators to imagine and visualize their unique objects within any number of scenarios, all within a framework that respects the original architecture of diffusion models and operates with remarkable compute efficiency.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.