Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models (2402.17910v1)

Published 27 Feb 2024 in cs.CV

Abstract: While latent diffusion models (LDMs) excel at creating imaginative images, they often lack precision in semantic fidelity and spatial control over where objects are generated. To address these deficiencies, we introduce the Box-it-to-Bind-it (B2B) module - a novel, training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models. B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance. The process encompasses two main steps: i) Object generation, which adjusts the latent encoding to guarantee object generation and directs it within specified bounding boxes, and ii) attribute binding, guaranteeing that generated objects adhere to their specified attributes in the prompt. B2B is designed as a compatible plug-and-play module for existing T2I models, markedly enhancing model performance in addressing the key challenges. We evaluate our technique using the established CompBench and TIFA score benchmarks, demonstrating significant performance improvements compared to existing methods. The source code will be made publicly available at https://github.com/nextaistudio/BoxIt2BindIt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. High-resolution image synthesis with latent diffusion models, 2022.
  2. Gligen: Open-set grounded text-to-image generation. CVPR, 2023.
  3. A variational perspective on solving inverse problems with diffusion models, 2023.
  4. Dreamsync: Aligning text-to-image generation with image understanding feedback, 2023.
  5. Metadata-conditioned generative models to synthesize anatomically-plausible 3d brain mris, 2023.
  6. Diverse data augmentation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2704–2714, October 2023.
  7. Creativesynth: Creative blending and synthesis of visual arts based on multimodal diffusion, 2024.
  8. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms, 2024.
  9. Scaling autoregressive models for content-rich text-to-image generation, 2022.
  10. Training-free layout control with cross-attention guidance, 2023.
  11. Dense text-to-image generation with attention modulation, 2023.
  12. Compose and conquer: Diffusion-based 3d depth aware composable image synthesis, 2024.
  13. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
  14. Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation, 2024.
  15. Grounded text-to-image synthesis with attention refocusing, 2023.
  16. Adding conditional control to text-to-image diffusion models, 2023.
  17. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  18. Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition, 2023.
  19. Loco: Locally constrained training-free layout-to-image synthesis, 2023.
  20. Freedom: Training-free energy-guided conditional diffusion model. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  21. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
  22. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation, 2023.
  23. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering, 2023.
  24. Photorealistic video generation with diffusion models, 2023.
  25. Editval: Benchmarking diffusion based text-guided image editing methods, 2023.
  26. Scenecomposer: Any-level semantic image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22468–22478, June 2023.
  27. Llm blueprint: Enabling text-to-image generation with complex and detailed prompts, 2023.
  28. Self-correcting llm-controlled diffusion models, 2023.
  29. Controllable text-to-image generation with gpt-4, 2023.
  30. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  31. Compositional visual generation with composable diffusion models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pages 423–439. Springer, 2022.
  32. Divide & bind your attention for improved generative semantic nursing, 2023.
  33. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  34. Multidiffusion: Fusing diffusion paths for controlled image generation, 2023.
  35. Syncdreamer: Generating multiview-consistent images from a single-view image, 2023.
  36. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7452–7461, 2023.
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  38. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  39. Reward-directed conditional diffusion: Provable distribution estimation and reward improvement. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  40. Diffusion reward: Learning rewards via conditional video diffusion, 2023.
  41. Microsoft coco: Common objects in context, 2015.
  42. Simple multi-dataset detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7571–7580, June 2022.
  43. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
Citations (3)

Summary

We haven't generated a summary for this paper yet.