Getting it Right: Improving Spatial Consistency in Text-to-Image Models (2404.01197v2)
Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that support algorithmic solutions to improve spatial reasoning in T2I models. We find that spatial relationships are under-represented in the image descriptions found in current vision-language datasets. To alleviate this data bottleneck, we create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets and through a 3-fold evaluation and analysis pipeline, show that SPRIGHT improves the proportion of spatial relationships in existing datasets. We show the efficacy of SPRIGHT data by showing that using only $\sim$0.25% of SPRIGHT results in a 22% improvement in generating spatially accurate images while also improving FID and CMMD scores. We also find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Through a set of controlled experiments and ablations, we document additional findings that could support future work that seeks to understand factors that affect spatial consistency in text-to-image models.
- Spatext: Spatio-textual representation for controllable image generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023.
- Introducing our multimodal models, 2023.
- Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval, 2022.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, 2023.
- Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023a.
- Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023b.
- Pali: A jointly-scaled multilingual language-image model, 2023c.
- Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In ICCV, 2023.
- Effectively unbiased fid and inception score and where to find them, 2020.
- Dall·e mini, 2021.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers, 2022.
- Investigating negation in pre-trained vision-and-language models. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 350–362, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
- Training-free structured diffusion guidance for compositional text-to-image synthesis, 2023a.
- Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393, 2023b.
- Can pre-trained text-to-image models generate visual goals for reinforcement learning?, 2023.
- Geneval: An object-focused framework for evaluating text-to-image alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Vqa-lol: Visual question answering under the lens of logic, 2020.
- Semantically distributed robust optimization for vision-and-language inference, 2022.
- Benchmarking spatial relationships in text-to-image generation, 2023.
- Prompt-to-prompt image editing with cross attention control, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
- T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation, 2023.
- Rethinking fid: Towards a better evaluation metric for image generation, 2024.
- Faithscore: Evaluating hallucinations in large vision-language models, 2023.
- kandinsky community. kandinsky, 2023.
- Bk-sdm: Architecturally compressed stable diffusion for efficient text-to-image generation. ICML Workshop on Efficient Systems for Foundation Models (ES-FoMo), 2023.
- Segment anything, 2023.
- Text-image alignment for diffusion-based perception, 2023.
- Similarity of neural network representations revisited, 2019.
- kuprel. min-dalle, 2022.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
- Karlo-v1.0.alpha on coyo-100m and cc15m. https://github.com/kakaobrain/karlo, 2022.
- Gligen: Open-set grounded text-to-image generation. 2023a.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds, 2023b.
- Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models, 2024.
- Microsoft coco: Common objects in context, 2015.
- Improved baselines with visual instruction tuning, 2023a.
- Compositional visual generation with composable diffusion models, 2023b.
- Decoupled weight decay regularization, 2019.
- Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023a.
- Lcm-lora: A universal stable-diffusion acceleration module, 2023b.
- Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
- OpenAI. Dalle-3, 2023a.
- OpenAI. Gpt-4(v), 2023b.
- On aliased resizing and surprising subtleties in gan evaluation, 2022.
- Eclipse: A resource-efficient text-to-image prior for image generations, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
- Learning transferable visual models from natural language supervision, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Zero-shot text-to-image generation, 2021.
- Hierarchical text-conditional image generation with clip latents, 2022.
- High-resolution image synthesis with latent diffusion models, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022a.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022b.
- A picture is worth a thousand words: Principled recaptioning improves image generation, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation, 2023a.
- Paragraph-to-image generation with information-enriched diffusion model, 2023b.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2945–2954, 2023.
- Reco: Region-controlled text-to-image generation, 2022.
- Coca: Contrastive captioners are image-text foundation models, 2022.
- Ifseg: Image-free semantic segmentation via vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2967–2977, 2023.
- Adding conditional control to text-to-image diffusion models, 2023a.
- Controllable text-to-image generation with gpt-4, 2023b.
- Recognize anything: A strong image tagging model, 2023c.
- Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
- Multi-lora composition for image generation, 2024.