Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Semantic Image Synthesis via Class-Adaptive Cross-Attention (2308.16071v3)

Published 30 Aug 2023 in cs.CV and cs.AI

Abstract: In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2337–2346.
  2. E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2287–2296.
  3. Y. Wang, L. Qi, Y.-C. Chen, X. Zhang, and J. Jia, “Image synthesis via semantic composition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 749–13 758.
  4. Z. Tan, M. Chai, D. Chen, J. Liao, Q. Chu, B. Liu, G. Hua, and N. Yu, “Diverse semantic image synthesis via probability distribution modeling,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2021, pp. 7962–7971.
  5. C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “Maskgan: Towards diverse and interactive facial image manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5549–5558.
  6. P. Zhu, R. Abdal, Y. Qin, and P. Wonka, “Sean: Image synthesis with semantic region-adaptive normalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5104–5113.
  7. X. Liu, G. Yin, J. Shao, X. Wang, and H. Li, “Learning to predict layout-to-image conditional convolutions for semantic image synthesis,” in Int’l. Conf. on Neural Information Processing Systems, 2019.
  8. W. Wang, J. Bao, W. Zhou, D. Chen, D. Chen, L. Yuan, and H. Li, “Semantic image synthesis via diffusion models,” arXiv preprint arXiv:2207.00050, 2022.
  9. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  10. C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
  11. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
  12. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695.
  13. P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2017, pp. 5967–5976.
  14. T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2018, pp. 8798–8807.
  15. T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Gaugan: Semantic image synthesis with spatially adaptive normalization,” in ACM SIGGRAPH 2019 Real-Time Live!, 2019.
  16. Z. Tan, D. Chen, Q. Chu, M. Chai, J. Liao, M. He, L. Yuan, G. Hua, and N. Yu, “Efficient semantic image synthesis via class-adaptive normalization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  17. T. Fontanini, C. Ferrari, M. Bertozzi, and A. Prati, “Automatic generation of semantic parts for face image synthesis,” in International Conference on Image Analysis and Processing.   Springer, 2023, pp. 209–221.
  18. T. Fontanini, C. Ferrari, G. Lisanti, L. Galteri, S. Berretti, M. Bertozzi, and A. Prati, “Frankenmask: Manipulating semantic masks with transformers for face parts editing,” Pattern Recognition Letters, vol. 176, pp. 14–20, 2023.
  19. Y. Li, Y. Li, J. Lu, E. Shechtman, Y. J. Lee, and K. K. Singh, “Collaging class-specific GANs for semantic image synthesis,” in IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 418–14 427.
  20. Y. Shi, X. Liu, Y. Wei, Z. Wu, and W. Zuo, “Retrieval-based spatially adaptive normalization for semantic image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 224–11 233.
  21. V. Sushko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele, and A. Khoreva, “Oasis: only adversarial supervision for semantic image synthesis,” International Journal of Computer Vision, vol. 130, no. 12, pp. 2903–2923, 2022.
  22. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
  23. Y. Shi, X. Yang, Y. Wan, and X. Shen, “Semanticstylegan: Learning compositional generative priors for controllable image synthesis and editing,” arXiv preprint arXiv:2112.02236, 2021.
  24. C. Ivan, “Convolutional neural networks on randomized data.” in CVPR Workshops, 2019, pp. 1–8.
  25. T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8798–8807.
  26. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision, 2016, pp. 694–711.
  27. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  28. Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096–1104.
  29. J. Zhu, Y. Shen, D. Zhao, and B. Zhou, “In-domain gan inversion for real image editing,” in European conference on computer vision.   Springer, 2020, pp. 592–608.
  30. Y. Shi, X. Yang, Y. Wan, and X. Shen, “Semanticstylegan: Learning compositional generative priors for controllable image synthesis and editing,” in CVPR, 2022.
  31. Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in IEEE Int’l. Conf. on Automatic Face & Gesture Recognition, 2018, pp. 67–74.
  32. Z. Zhu, Z. Xu, A. You, and X. Bai, “Semantically multi-modal image synthesis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5467–5476.
  33. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2018, pp. 586–595.
  34. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2020.
Citations (3)

Summary

We haven't generated a summary for this paper yet.