BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models (2401.13974v1)

Published 25 Jan 2024 in cs.CV, cs.AI, and cs.GR

Abstract: Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.

Authors (4)

Senthil Purushwalkam (23 papers)
Akash Gokul (13 papers)
Shafiq Joty (187 papers)
Nikhil Naik (25 papers)

Citations (10)

View on Semantic Scholar

Summary

The paper presents BootPIG, which introduces a novel bootstrapped training approach for zero-shot personalized image synthesis through a dual UNet architecture.
The method employs a Reference Self-Attention mechanism that integrates features from reference images directly into pretrained diffusion models with high efficiency.
Experiments show that BootPIG outperforms standard zero-shot and test-time finetuning methods in both subject fidelity and prompt compliance.

Introduction to BootPIG

Recent advancements in generative models have been notable, particularly in the field of text-to-image synthesis. While generative diffusion models have exhibited proficiency in creating images from textual prompts, one area that requires further research is personalized image generation - the ability to generate images of specific objects within varied user-defined contexts.

Salesforce AI Research introduces a novel architecture, termed BootPIG, aiming to achieve zero-shot subject-driven generation within pretrained diffusion models. The key innovation with BootPIG is its capacity to inject features from reference images into the generative process, enabling personalized image synthesis without the need for subject-specific finetuning.

Architecture Overview

BootPIG proposes minimal modifications to a pretrained text-to-image diffusion model, essentially splitting it into two parts: the Base UNet and the Reference UNet. The Reference UNet functions to capture the visual features from reference images of the subject, which are then fed into the attention layers of the Base UNet. The architecture utilizes a novel Reference Self-Attention mechanism, replacing standard self-attention layers within the diffusion model, allowing for personalization cues to be incorporated into the generation process.

The practicality of BootPIG lies in its bootstrapped training procedure. Instead of relying on a large curated dataset, BootPIG learns personalization through synthetic data, procured through the stablefusions of images synthesized by the pretrained models and segmentation capabilities. Notably, the training of BootPIG can be completed in a matter of about an hour on 16 A100 GPUs, an efficiency leap compared to existing Zero-Shot Inference methods or Test-time Finetuning methods.

Evaluation and Findings

BootPIG has been put through rigorous tests against standard datasets like DreamBooth. The experiments demonstrate that it outperforms existing zero-shot methods significantly in both subject and prompt fidelity and is competitive with or surpasses test-time finetuning methods. Crucially, these achievements are undergirded by an inference mechanism adept at leveraging multiple reference images, augmenting detail and nuance in the final image synthesis.

Moreover, a user paper confirms that BootPIG is consistently preferred over established methods in terms of subject fidelity and prompt compliance. These findings underscore BootPIG's potential for wide-ranging applications, including personalized storytelling and design, hinting at a new direction in subject-driven generation research.

Concluding Thoughts on BootPIG

The BootPIG architecture by Salesforce AI Research grants new capabilities to pretrained diffusion models, pushing forward the boundary of personalized image generation. By harnessing bootstrapped learning procedures and synthesizing data that avoid hard reliance on curated datasets, BootPIG stands out for its efficiency and quality of output.

Given the rapid developments in AI and image generation, the BootPIG model represents a significant leap in the precision and customization available to users – allowing creators to imagine and visualize their unique objects within any number of scenarios, all within a framework that respects the original architecture of diffusion models and operates with remarkable compute efficiency.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1750703140901388634

https://twitter.com/fly51fly/status/1751387423790207074

https://twitter.com/SFResearch/status/1750993983415939544

https://twitter.com/javaeeeee1/status/1753807187661463587