Divide & Bind Your Attention for Improved Generative Semantic Nursing (2307.10864v3)
Abstract: Emerging large-scale text-to-image generative models, e.g., Stable Diffusion (SD), have exhibited overwhelming results with high fidelity. Despite the magnificent progress, current state-of-the-art models still struggle to generate images fully adhering to the input prompt. Prior work, Attend & Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming to optimize cross-attention during inference time to better incorporate the semantics. It demonstrates promising results in generating simple prompts, e.g., "a cat and a dog". However, its efficacy declines when dealing with more complex prompts, and it does not explicitly address the problem of improper attribute binding. To address the challenges posed by complex prompts or scenarios involving multiple entities and to achieve improved attribute binding, we propose Divide & Bind. We introduce two novel loss objectives for GSN: a novel attendance loss and a binding loss. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts and exhibits superior performance across multiple evaluation benchmarks.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Language models are few-shot learners. NeurIPS, 2020.
- Total variation in imaging. Handbook of mathematical methods in imaging, 1(2):3, 2015.
- Total variation image restoration: Overview and recent developments. Handbook of mathematical models in computer vision, 2006.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Attend-and-Excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, 2023.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. In ICLR, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 33, 2020.
- In-context learning for few-shot dialogue state tracking. In EMNLP, 2022a.
- In-context learning for few-shot dialogue state tracking. arXiv preprint arXiv:2203.08568, 2022b.
- TIFA: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
- Scaling up gans for text-to-image synthesis. In CVPR, 2023.
- Diffusion models already have a semantic latent space. In ICLR, 2023.
- mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections. In EMNLP, 2022a.
- Lavis: A library for language-vision intelligence, 2022b.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In UCML, 2022c.
- Magicmix: Semantic mixing with diffusion models. arXiv preprint arXiv:2210.16056, 2022.
- Compositional visual generation with composable diffusion models. In ECCV, 2022.
- LLMScore: Unveiling the power of large language models in text-to-image synthesis evaluation. arXiv preprint arXiv:2305.11116, 2023.
- A very preliminary analysis of dall-e 2. arXiv preprint arXiv:2204.13807, 2022.
- Improved denoising diffusion probabilistic models. In ICML, 2021.
- Teaching clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Denoising diffusion implicit models. In ICLR, 2020.
- Image segmentation via total variation and hypothesis testing methods. 2011.
- What the DAAM: Interpreting stable diffusion using cross attention. In ACL, 2023.
- Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint arXiv:2210.14896, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022.