Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models (2311.16117v2)
Abstract: Diffusion models have achieved remarkable results in generating high-quality, diverse, and creative images. However, when it comes to text-based image generation, they often fail to capture the intended meaning presented in the text. For instance, a specified object may not be generated, an unnecessary object may be generated, and an adjective may alter objects it was not intended to modify. Moreover, we found that relationships indicating possession between objects are often overlooked. While users' intentions in text are diverse, existing methods tend to specialize in only some aspects of these. In this paper, we propose Predicated Diffusion, a unified framework to express users' intentions. We consider that the root of the above issues lies in the text encoder, which often focuses only on individual words and neglects the logical relationships between them. The proposed method does not solely rely on the text encoder, but instead, represents the intended meaning in the text as propositions using predicate logic and treats the pixels in the attention maps as the fuzzy predicates. This enables us to obtain a differentiable loss function that makes the image fulfill the proposition by minimizing it. When compared to several existing methods, we demonstrated that Predicated Diffusion can generate images that are more faithful to various text prompts, as verified by human evaluators and pretrained image-text models.
- Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond. arXiv, 2023.
- Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. In ACM SIGGRAPH, 2023.
- Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Semantic-based regularization for learning and inference. Artificial Intelligence, 244:143–165, 2017.
- Compositional visual generation with energy based models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In International Conference on Learning Representations (ICLR), 2023.
- Logical Foundations of Artificial Intelligence. Morgan Kaufmann, Los Altos, Calif, 1987.
- Deep Learning with Logical Constraints. In International Joint Conference on Artificial Intelligence (IJCAI), pages 5478–5485, 2022.
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, 2014.
- Petr Hájek. Metamathematics of Fuzzy Logic. Springer Netherlands, Dordrecht, 1998.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), pages 6840–6851, 2020.
- Harnessing Deep Neural Networks with Logic Rules. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 2410–2420, 2016.
- Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
- PixelCNN Models with Auxiliary Variables for Natural Image Modeling. In International Conference on Machine Learning (ICML), 2017.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning (ICML), pages 12888–12900. PMLR, 2022.
- Compositional Visual Generation with Composable Diffusion Models. In European Conference on Computer Vision (ECCV), 2022.
- Directed Diffusion: Direct Control of Object Placement through Attention Guidance. arXiv, 2023.
- Training-Free Location-Aware Text-to-Image Synthesis. arXiv, 2023.
- T-Norms Driven Loss Functions for Machine Learning. Applied Intelligence, 2023.
- Assessing Image and Text Generation with Topological Analysis and Fuzzy Logic. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2012–2021, Waikoloa, HI, USA, 2021. IEEE.
- Shape-Guided Diffusion with Inside-Outside Attention. arXiv, 2023.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. 2023.
- Theory and Applications of Ordered Fuzzy Numbers. Springer International Publishing, Cham, 2017.
- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR), 2016.
- Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021.
- Zero-Shot Text-to-Image Generation. In International Conference on Machine Learning (ICML), pages 8821–8831. PMLR, 2021.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv, 2022.
- Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment. arXiv, 2023.
- High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, 2015.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), pages 2246–2255, 2015.
- Score-Based Generative Modeling through Stochastic Differential Equations. In International Conference on Learning Representations (ICLR), 2021.
- Pixel recurrent neural networks. In International Conference on Machine Learning (ICML), pages 2611–2620, 2016.
- Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), 2017.
- Exploring CLIP for Assessing the Look and Feel of Images. In AAAI Conference on Artificial Intelligence (AAAI), pages 2555–2563, 2023a.
- DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 893–911, 2023b.
- BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion. In International Conference on Computer Vision (ICCV), 2023.
- When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It? In International Conference on Learning Representations (ICLR), 2023.