Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Training-Free Layout Control with Cross-Attention Guidance (2304.03373v2)

Published 6 Apr 2023 in cs.CV

Abstract: Recent diffusion-based generators can produce high-quality images from textual prompts. However, they often disregard textual instructions that specify the spatial layout of the composition. We propose a simple approach that achieves robust layout control without the need for training or fine-tuning of the image generator. Our technique manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the generation in the desired direction given, e.g., a user-specified layout. To determine how to best guide attention, we study the role of attention maps and explore two alternative strategies, forward and backward guidance. We thoroughly evaluate our approach on three benchmarks and provide several qualitative examples and a comparative analysis of the two strategies that demonstrate the superiority of backward guidance compared to forward guidance, as well as prior work. We further demonstrate the versatility of layout guidance by extending it to applications such as editing the layout and context of real images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18370–18380, 2023.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. Universal guidance for diffusion models. arXiv preprint arXiv:2302.07121, 2023.
  4. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
  5. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  6. Zero-shot spatial layout conditioning for text-to-image diffusion models. arXiv preprint arXiv:2306.13754, 2023.
  7. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  8. Dall· e mini, 2021.
  9. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  10. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  11. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, pages 16890–16902, 2022.
  12. Frido: Feature pyramid diffusion for complex scene image synthesis. arXiv preprint arXiv:2208.13753, 2022.
  13. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  14. Make-a-scene: Scene-based text-to-image generation with human priors. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 89–106. Springer, 2022.
  15. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  16. Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015, 2022.
  17. Generative adversarial nets. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2014.
  18. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  19. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  20. Generating multiple objects at spatially distinct locations. arXiv preprint arXiv:1901.00686, 2019.
  21. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  22. Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7986–7994, 2018.
  23. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  24. Multimodal conditional image synthesis with product-of-experts gans. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI, pages 91–109. Springer, 2022.
  25. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1219–1228, 2018.
  26. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  27. Gligen: Open-set grounded text-to-image generation. arXiv preprint arXiv:2301.07093, 2023.
  28. Microsoft COCO: common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), 2014.
  29. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), 2014.
  30. Compositional visual generation with composable diffusion models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pages 423–439. Springer, 2022.
  31. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022.
  32. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  33. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019.
  34. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision (IJCV), 2017.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  36. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, page 3, 2022.
  37. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  38. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  40. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  41. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  42. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
  43. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  44. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  45. High-fidelity guided image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5997–6006, 2023.
  46. Image synthesis from reconfigurable layout and style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10531–10540, 2019.
  47. Object-centric image generation from layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2647–2655, 2021.
  48. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16515–16525, 2022.
  49. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
  50. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. arXiv preprint arXiv:2307.10816, 2023.
  51. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  52. Modeling image composition for complex scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7764–7773, 2022.
  53. Reco: Region-controlled text-to-image generation. arXiv preprint arXiv:2211.15518, 2022.
  54. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2014.
  55. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  56. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 833–842, 2021.
  57. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
  58. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 41(8):1947–1962, 2018.
  59. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  60. Image generation from layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8584–8593, 2019.
  61. Detecting twenty-thousand classes using image-level supervision. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
Citations (172)

Summary

  • The paper introduces a training-free method for precise layout control in text-to-image diffusion models using cross-attention guidance.
  • It employs forward guidance by directly modulating attention maps and backward guidance by optimizing a loss function to adjust latent representations.
  • Experimental results indicate significantly enhanced spatial fidelity and layout adherence on benchmarks like VISOR, COCO 2014, and Flickr30K.

Training-Free Layout Control with Cross-Attention Guidance

Introduction

The paper "Training-Free Layout Control with Cross-Attention Guidance" introduces a method for achieving layout control in text-to-image generators, specifically diffusion-based models like Stable Diffusion, without necessitating additional model training or fine-tuning. The focus is on leveraging cross-attention layers to modulate the spatial layout of generated images according to user-specified instructions, which often involve bounding boxes specifying the desired positioning of objects in the composition. The method is based on two strategies: forward guidance, where the attention maps are directly manipulated to bias the layout, and backward guidance, which uses a loss function to drive attention alignment via backpropagation.

Methodology

Stable Diffusion Overview

Stable Diffusion operates in latent space, converting text prompts into images through a sequence of denoising steps. The model's text encoder maps input prompts into token vectors that embed spatial and semantic information, influencing image generation through cross-attention layers. These layers modulate how visual and textual data interact, controlling how spatial features in the latent space map to components of the textual prompt.

Forward and Backward Guidance

Forward guidance imposes predefined spatial biases on the cross-attention maps for specific text tokens, directly influencing subsequent denoising iteration outcomes. However, its simplistic mechanism may fail in the presence of complex inter-token semantic dependencies, such as those involving start ([SoT]) and padding ([EoT]) tokens which also carry layout-relevant information.

Backward guidance addresses these limitations by introducing and optimizing an energy function to encourage desirable attention patterns. This approach adjusts latent representations iteratively, propagating updates through the network that influence all tokens' attention maps, thus effectively achieving layout control even under complex compositional requirements. Figure 1

Figure 1: Overview of the two layout guidance strategies. The cross-attention map for a chosen word token is marked with a red border. In forward guidance, the cross-attention maps of the word, start and padding tokens are biased spatially. In backward guidance, we compute instead a loss function and perform backpropagation during the inference process to optimize the latent.

Implementation Details

Algorithmic Workflow

For backward guidance, the cross-attention layers of Stable Diffusion are selected strategically, often focusing on layers most crucial for semantic coherence in the upsampling branch. The backward approach applies a loss across these attention maps, guiding their evolution by iteratively updating latent variables at key steps of the denoising process, generally early in the generation phase.

Loss Function Design

The loss function aims to align attention maps with the specified layout, using bounding box constraints that define the expected token spatial regions. This loss is computed over a predefined iteration range and backpropagated to adjust latent vector representations, thereby steering the model outputs closer to the specified layout intent.

Experimental Evaluation

The approach was evaluated against several measures, including the VISOR benchmark, which quantifies models' spatial understanding through the accurate depiction of specified object relations. Compared to existing models like GLIDE and DALLE, the proposed backward guidance demonstrates superior adherence to spatial instructions, significantly boosting layout fidelity metrics without compromising overall image quality. Additional evaluations with COCO 2014 and Flickr30K datasets further highlight improvements in both spatial control and generative quality, as indicated by better FID and mAP scores. Figure 2

Figure 2: Cross-attention maps during forward and backward guidance. Spatial dependencies between different words negatively affect forward guidance, while backward guidance softly encourages all dependent tokens to match the desired layout.

Comparative Analysis

Backward guidance effectively addresses the forward guidance limitations by implicitly adjusting non-explicitly controlled tokens, compensating for natural semantic overlap in text encoding. This capability becomes evident when examining scenarios involving complex inter-object relationships or when processing queries with atypical compositional syntax. Notably, while backward guidance requires more computational resources due to the iterative update nature, it ultimately offers a more robust mechanism for precise layout adherence in final images.

Real-World Applications and Extensions

Besides enhancing text-to-image generation, this technique shows potential in real-image editing tasks, facilitating controlled alterations that preserve original identifiers through specialized tokens like those used in Textual Inversion. By integrating layout guidance, users can direct image modifications with unprecedented precision, fundamentally expanding creative and practical applications in digital content creation.

Conclusion

The paper highlights the nuanced role of cross-attention in contextualizing layout-specific attributes in image generation. By harnessing the robustness of backward guidance, the research provides a practical solution to a key limitation in generative models, enabling precise spatial control absent extensive training overheads. Future endeavors could explore automatic bounding box generation or extend these principles to other generative domains, such as 3D content synthesis or video generation, with similar spatial constraints. Figure 3

Figure 3: Comparison between forward and backward guidance, including guidance of start and padding tokens.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com