ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation (2306.00971v2)
Abstract: Personalized text-to-image generation using diffusion models has recently emerged and garnered significant interest. This task learns a novel concept (e.g., a unique toy), illustrated in a handful of images, into a generative model that captures fine visual details and generates photorealistic images based on textual embeddings. In this paper, we present ViCo, a novel lightweight plug-and-play method that seamlessly integrates visual condition into personalized text-to-image generation. ViCo stands out for its unique feature of not requiring any fine-tuning of the original diffusion model parameters, thereby facilitating more flexible and scalable model deployment. This key advantage distinguishes ViCo from most existing models that necessitate partial or full diffusion fine-tuning. ViCo incorporates an image attention module that conditions the diffusion process on patch-wise visual semantics, and an attention-based object mask that comes at no extra cost from the attention module. Despite only requiring light parameter training (~6% compared to the diffusion U-Net), ViCo delivers performance that is on par with, or even surpasses, all state-of-the-art models, both qualitatively and quantitatively. This underscores the efficacy of ViCo, making it a highly promising solution for personalized text-to-image generation without the need for diffusion model fine-tuning. Code: https://github.com/haoosz/ViCo
- Clip2stylegan: Unsupervised extraction of stylegan edit directions. In ACM SIGGRAPH, 2022.
- Label-efficient semantic segmentation with diffusion models. In ICLR, 2022.
- Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023.
- Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. In ECCV, 2022.
- Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023a.
- Encoder-based domain tuning for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228, 2023b.
- Image style transfer using convolutional neural networks. In CVPR, 2016.
- Generative adversarial nets. In NeurIPS, 2014.
- Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Video diffusion models. In NeurIPS, 2022.
- Argmax flows and multinomial diffusion: Learning categorical distributions. NeurIPS, 2021.
- Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
- Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
- Perceptual losses for real-time style transfer and super-resolution. In eccv, 2016.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Analyzing and improving the image quality of stylegan. In CVPR, 2020.
- Alias-free generative adversarial networks. In NeurIPS, 2021.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
- Learning representations for automatic colorization. In ECCV, pp. 577–593, 2016.
- Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
- Countering language drift via visual grounding. In EMNLP, 2019.
- Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720, 2023.
- Countering language drift with seeded iterated learning. In ICML, 2020.
- Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
- Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Nobuyuki Otsu. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics, 1979.
- Contrastive learning for unpaired image-to-image translation. In ECCV, 2020.
- Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021.
- Controlling text-to-image diffusion by orthogonal finetuning. In NeurIPS, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Zero-shot text-to-image generation. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Generative adversarial text to image synthesis. In ICML, 2016.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
- Denoising diffusion implicit models. In ICLR, 2021.
- Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR, 2022.
- Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Esrgan: Enhanced super-resolution generative adversarial networks. In ECCV, 2018.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
- Tedigan: Text-guided diverse face image generation and manipulation. In CVPR, 2021.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
- Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
- Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021.
- Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
- Colorful image colorization. In ECCV, 2016.
- Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG), 2017.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017a.
- Toward multimodal image-to-image translation. In NeurIPS, 2017b.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, 2019.
- Shaozhe Hao (13 papers)
- Kai Han (184 papers)
- Shihao Zhao (13 papers)
- Kwan-Yee K. Wong (51 papers)