Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems (2312.06573v2)

Published 11 Dec 2023 in cs.CV

Abstract: The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. In state-of-the-art approaches, this guidance is realized by a separate controlling model that controls a pre-trained image generation network, such as a latent diffusion model. Understanding this process from a control system perspective shows that it forms a feedback-control system, where the control module receives a feedback signal from the generation process and sends a corrective signal back. When analysing existing systems, we observe that the feedback signals are timely sparse and have a small number of bits. As a consequence, there can be long delays between newly generated features and the respective corrective signals for these features. It is known that this delay is the most unwanted aspect of any control system. In this work, we take an existing controlling network (ControlNet) and change the communication between the controlling network and the generation process to be of high-frequency and with large-bandwidth. By doing so, we are able to considerably improve the quality of the generated images, as well as the fidelity of the control. Also, the controlling network needs noticeably fewer parameters and hence is about twice as fast during inference and training time. Another benefit of small-sized models is that they help to democratise our field and are likely easier to understand. We call our proposed network ControlNet-XS. When comparing with the state-of-the-art approaches, we outperform them for pixel-level guidance, such as depth, canny-edges, and semantic segmentation, and are on a par for loose keypoint-guidance of human poses. All code and pre-trained models will be made publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Midjourney, 2023. https://www.midjourney.com/.
  2. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723, 2022.
  3. Improving Image Generation with Better Captions. 2023.
  4. Image watermarking between conventional and learning-based techniques: A literature review. Electronics, 12(1):74, 2022.
  5. John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
  6. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  9. CogView: Mastering Text-to-Image Generation via Transformers. In M Ranzato, A Beygelzimer, Y Dauphin, P S Liang, and J Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19822–19835. Curran Associates, Inc., 2021.
  10. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  11. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. In Gabriel Avidan Shai and Brostow, Cissé Moustapha, Farinella Giovanni Maria, and Hassner Tal, editors, Computer Vision – ECCV 2022, pages 89–106, Cham, 2022. Springer Nature Switzerland.
  12. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  13. PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models, 2023.
  14. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  15. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  18. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
  19. LoRA: Low-Rank Adaptation of Large Language Models, 2021.
  20. Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  21. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  22. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  23. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  24. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34, 2021.
  25. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  26. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  27. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  28. Exploring Plain Vision Transformer Backbones for Object Detection. In Computer Vision – ECCV 2022, pages 280–296. Springer Nature Switzerland, 2022.
  29. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  30. UNIPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers, volume 1, pages 6253–6264. ACL, 2022.
  31. Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Applied Intelligence, 53(4):3974–4026, Feb 2023.
  32. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. volume 24, pages 109–165. Academic Press, 1989.
  33. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  34. Image to image translation for domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4500–4509, 2018.
  35. Deep learning for deepfakes creation and detection: A survey. Computer Vision and Image Understanding, 223:103525, 2022.
  36. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  37. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  38. Image-to-Image Translation: Methods and Applications. CoRR, abs/2101.08629, 2021.
  39. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 487–503. ACL, 2021.
  40. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, 2023.
  41. Controlling Text-to-Image Diffusion by Orthogonal Finetuning. In NeurIPS, 2023.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763, 2021.
  43. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  45. Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022.
  46. Zero-Shot Text-to-Image Generation. CoRR, abs/2102.1, 2021.
  47. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  48. Efficient Parametrization of Multi-Domain Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  49. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
  50. Learning what and where to draw. Advances in neural information processing systems, 29, 2016.
  51. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  52. Incremental Learning Through Deep Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):651–663, 2020.
  53. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  54. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  55. Palette: Image-to-Image Diffusion Models. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings. ACM, 2022.
  56. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, and A Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022.
  57. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  58. StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
  59. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  60. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  61. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  62. BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 5986–5995. PMLR, 2019.
  63. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16515–16525, 2022.
  64. The caltech-ucsd birds-200-2011 dataset. 2011.
  65. Pretraining is All You Need for Image-to-Image Translation, 2022.
  66. Image-to-Image Translation for Autonomous Driving from Coarsely-Aligned Image Pairs. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7756–7762. IEEE, 2023.
  67. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  68. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7754–7765, 2023.
  69. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
  70. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  71. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
  72. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023.
  73. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  74. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2305.16322, 2023.
  75. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  76. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5810, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Denis Zavadski (2 papers)
  2. Johann-Friedrich Feiden (1 paper)
  3. Carsten Rother (74 papers)

Summary

  • The paper introduces an efficient control architecture that reduces parameters and improves image fidelity and speed.
  • The study employs zero-convolutions in its training methodology to stabilize generative performance and achieve high evaluation metrics like CLIP-Score and LPIPS.
  • The work addresses semantic biases by minimizing model size, thereby reducing unintended influences and promoting more ethical AI applications.

Introduction

In the field of text-to-image generation, the integration of intuitive spatial guidance through controlling networks has become pivotal for steering the output towards a desired image. A controlling network allows the users to influence the image generation process using not just text-prompts but also guidance images like sketches or depth maps. This paper introduces an architecture named ControlNet-XS, a more efficient and effective successor to the known ControlNet, designed for controlling text-to-image diffusion models.

Improved Architecture and Performance

The proposed ControlNet-XS architecture stands out for requiring significantly fewer parameters than its predecessor while enhancing the quality and fidelity of control. Moreover, ControlNet-XS operates approximately twice as fast during inference and training, showcasing its efficiency. The paper details the problems with the delayed information transfer in existing controlling networks and the novel approach taken by ControlNet-XS to mitigate this issue.

Training Methodology and Evaluation

Trained on one million images from a dataset, ControlNet-XS uses zero-convolutions to prevent diminished generative capabilities of the controlled generative network at the start of the training. To evaluate performance, metrics like CLIP-Score, Learned Perceptual Image Patch Similarity (LPIPS), and Mean Squared Error for depth (MSE-depth) were used. ControlNet-XS outperformed competitors and showed that the model size could be reduced without significant performance losses.

Addressing Biases and Limitations

The research highlights the problem of semantic biases where large controlling networks may influence the generative model, inducing unintended output. ControlNet-XS addresses this by reducing the size of the control model, minimizing bias while maintaining high control efficiency. This approach resonates with the broader need for understanding and addressing biases within the AI-driven generative models.

Conclusion and Societal Impact

Concludingly, ControlNet-XS provides a significant advancement in controlled text-to-image generation with its method of efficient communication between the generative and controlling processes. With the provided code and pre-trained models, the work invites further innovation in the area. As these generative models advance, the paper acknowledges the societal implications, specifically the concerns around creating deep fakes, highlighting the necessity for ongoing research into misuse prevention and detection.

Reddit Logo Streamline Icon: https://streamlinehq.com