Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance (2405.17532v1)

Published 27 May 2024 in cs.CV

Abstract: Recent text-to-image customization works have been proven successful in generating images of given concepts by fine-tuning the diffusion models on a few examples. However, these methods tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (e.g. headphone is missing when generating a <sks> dog wearing a headphone'). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (e.g. a dog wearing a headphone) implying that the compositional ability only disappears after personalization tuning. Inspired by this observation, we present ClassDiffusion, a simple technique that leverages a semantic preservation loss to explicitly regulate the concept space when learning the new concept. Despite its simplicity, this helps avoid semantic drift when fine-tuning on the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of the fine-tune models. In response to the ineffective evaluation of CLIP-T metrics, we introduce BLIP2-T metric, a more equitable and effective evaluation metric for this particular domain. We also provide in-depth empirical study and theoretical analysis to better understand the role of the proposed loss. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (92)
  1. A neural space-time representation for text-to-image personalization. ACM Transactions on Graphics (TOG), 42(6):1–10, 2023.
  2. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18370–18380, 2023.
  3. Dreamdiffusion: Generating high-quality images from brain eeg signals. arXiv preprint arXiv:2306.16934, 2023.
  4. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  5. Universal guidance for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 843–852, 2023.
  6. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  7. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  8. Integrating geometric control into text-to-image diffusion models for high-quality detection data generation via text prompt. arXiv preprint arXiv:2306.04607, 2023.
  9. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022.
  10. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023.
  11. Layoutdiffuse: Adapting foundational diffusion models for layout-to-image generation. arXiv preprint arXiv:2302.08908, 2023.
  12. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  13. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  14. Dreamartist: Towards controllable one-shot text-to-image generation via positive-negative prompt-tuning. arXiv preprint arXiv:2211.11337, 2022.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Compositional visual generation with energy based models. Advances in Neural Information Processing Systems, 33:6637–6647, 2020.
  17. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.
  18. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
  19. Brainvis: Exploring the bridge between brain and visual signals via image reconstruction. arXiv preprint arXiv:2312.14871, 2023.
  20. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2023.
  21. Designing an encoder for fast personalization of text-to-image models. arXiv e-prints, pages arXiv–2302, 2023.
  22. Expressive text-to-image generation with rich text. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7545–7556, 2023.
  23. Autogan: Neural architecture search for generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3224–3234, 2019.
  24. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  25. Tiam – a metric for evaluating alignment in text-to-image generation, 2024.
  26. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023.
  27. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. International Conference on Learning Representations, 2024.
  28. Modulating pretrained diffusion models for multimodal image synthesis. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  29. Svdiff: Compact parameter space for diffusion fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7323–7334, 2023.
  30. Automated black-box prompt engineering for personalized text-to-image generation. arXiv preprint arXiv:2403.19103, 2024.
  31. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  32. Instruct-imagen: Image generation with multi-modal instruction, 2024.
  33. Dreamtuner: Single image is enough for subject-driven generation. arXiv preprint arXiv:2312.13691, 2023.
  34. Identity decoupling for multi-subject personalization of text-to-image models, 2024.
  35. Humansd: A native skeleton-guided diffusion model for human image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15988–15998, 2023.
  36. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  37. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  38. Concept weaver: Enabling multi-concept fusion in text-to-image models, 2024.
  39. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems, 36, 2024.
  40. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  41. Unimo-g: Unified image generation through multimodal conditional diffusion, 2024.
  42. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  43. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
  44. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023.
  45. Minddiffuser: Controlled image reconstruction from human brain activity with semantic and structural diffusion. In Proceedings of the 31st ACM International Conference on Multimedia, pages 5899–5908, 2023.
  46. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  47. Image anything: Towards reasoning-coherent and training-free multi-modal image generation, 2024.
  48. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. arXiv preprint arXiv:2307.11410, 2023.
  49. Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319, 2023.
  50. David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
  51. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  52. Natural image reconstruction from fmri based on self-supervised representation learning and latent diffusion model. In Proceedings of the 15th International Conference on Digital Image Processing, pages 1–9, 2023.
  53. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  54. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  55. Natural scene reconstruction from fmri signals using generative latent diffusion. Scientific Reports, 13(1):15666, 2023.
  56. Cat: Contrastive adapter training for personalized image generation, 2024.
  57. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  58. Gluegen: Plug and play multi-modal encoders for x-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23085–23096, 2023.
  59. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  60. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  61. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  62. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Advances in Neural Information Processing Systems, 36, 2024.
  63. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  64. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  65. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  66. Adversarial diffusion distillation, 2023.
  67. Mingkai Shing. Svdiff: Stochastic video diffusion for conditional video generation. https://github.com/mkshing/svdiff-pytorch, 2023.
  68. Divide, evaluate, and refine: Evaluating and improving text-to-image alignment with iterative vqa feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  69. Styledrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
  70. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  71. Moma: Multimodal llm adapter for fast personalized image generation, 2024.
  72. Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
  73. Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  74. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
  75. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  76. Is this loss informative? faster text-to-image customization by tracking objective dynamics. Advances in Neural Information Processing Systems, 36, 2024.
  77. Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  78. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  79. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023.
  80. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
  81. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
  82. U-vap: User-specified visual appearance personalization via decoupled self augmentation. arXiv preprint arXiv:2403.20231, 2024.
  83. A closer look at parameter-efficient tuning in diffusion models. arXiv preprint arXiv:2303.18181, 2023.
  84. Imagereward: Learning and evaluating human preferences for text-to-image generation, 2023.
  85. Magicanimate: Temporally consistent human image animation using diffusion model. arXiv preprint arXiv:2311.16498, 2023.
  86. Freestyle layout-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14256–14266, 2023.
  87. Align, adapt and inject: Sound-guided unified image generation. arXiv preprint arXiv:2306.11504, 2023.
  88. Cle diffusion: Controllable light enhancement diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8145–8156, 2023.
  89. Freedom: Training-free energy-guided conditional diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23174–23184, 2023.
  90. Adding conditional control to text-to-image diffusion models, 2023.
  91. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023.
  92. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
Citations (2)

Summary

We haven't generated a summary for this paper yet.