Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing (2401.03433v1)

Published 7 Jan 2024 in cs.CV

Abstract: Text-conditional image editing based on large diffusion generative model has attracted the attention of both the industry and the research community. Most existing methods are non-reference editing, with the user only able to provide a source image and text prompt. However, it restricts user's control over the characteristics of editing outcome. To increase user freedom, we propose a new task called Specific Reference Condition Real Image Editing, which allows user to provide a reference image to further control the outcome, such as replacing an object with a particular one. To accomplish this, we propose a fast baseline method named SpecRef. Specifically, we design a Specific Reference Attention Controller to incorporate features from the reference image, and adopt a mask mechanism to prevent interference between editing and non-editing regions. We evaluate SpecRef on typical editing tasks and show that it can achieve satisfactory performance. The source code is available on https://github.com/jingjiqinggong/specp2p.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020.
  2. Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” NeurIPS, vol. 32, 2019.
  3. A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML.   PMLR, 2021, pp. 8162–8171.
  4. J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023.
  5. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” arXiv preprint arXiv:2208.12242, 2022.
  6. R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
  7. J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation.” J. Mach. Learn. Res., vol. 23, pp. 47–1, 2022.
  8. Y. Huang, Y. Dong, H. Zhang, J. Huang, and S. Chen, “Learning image-adaptive lookup tables with spatial awareness for image harmonization,” IEEE Transactions on Consumer Electronics, pp. 1–1, 2023.
  9. Y. Liu, J. Huang, and S. Chen, “Deseal: Semantic-aware seal2clear attention for document seal removal,” IEEE Signal Processing Letters, vol. 30, pp. 1702–1706, 2023.
  10. E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” arXiv preprint arXiv:2008.00951, 2020.
  11. P. Zhu, R. Abdal, Y. Qin, and P. Wonka, “Improved stylegan embedding: Where are the good latents?” 2020.
  12. K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, and E. Raff, “Vqgan-clip: Open domain image generation and editing with natural language guidance,” in ECCV.   Springer, 2022, pp. 88–105.
  13. A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
  14. C. Meng, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Image synthesis and editing with stochastic differential equations,” arXiv preprint arXiv:2108.01073, 2021.
  15. J. Huang, Y. Liu, J. Qin, and S. Chen, “Kv inversion: Kv embeddings learning for text-conditioned real image action editing,” 2023.
  16. S. Chen and J. Huang, “Fec: Three finetuning-free methods to enhance consistency for real image editing,” arXiv preprint arXiv:2309.14934, 2023.
  17. J. Huang, Y. Liu, Y. Huang, and S. Chen, “Seal2real: Prompt prior learning on diffusion model for unsupervised document seal data generation and realisation,” arXiv preprint arXiv:2310.00546, 2023.
  18. O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” arXiv preprint arXiv:2206.02779, 2022.
  19. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
  20. G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in CVPR, 2022, pp. 2426–2435.
  21. Y. Huang, J. Huang, J. Liu, Y. Dong, J. Lv, and S. Chen, “Wavedm: Wavelet-based diffusion models for image restoration,” arXiv preprint arXiv:2305.13819, 2023.
  22. J. Huang, Y. Liu, and S. Chen, “Bootstrap diffusion model curve estimation for high resolution low-light image enhancement,” arXiv preprint arXiv:2309.14709, 2023.
  23. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2021.
  24. A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
  25. T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” arXiv preprint arXiv:2211.09800, 2022.
  26. M. Cao, X. Wang, Z. Qi, Y. Shan, X. Qie, and Y. Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” arXiv preprint arXiv:2304.08465, 2023.
  27. R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text inversion for editing real images using guided diffusion models,” arXiv preprint arXiv:2211.09794, 2022.
  28. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695.
  29. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI.   Springer, 2015, pp. 234–241.
  30. N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” arXiv preprint arXiv:2211.12572, 2022.
  31. H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” arXiv preprint arXiv:2301.13826, 2023.
  32. B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” arXiv preprint arXiv:2210.09276, 2022.
  33. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Songyan Chen (3 papers)
  2. Jiancheng Huang (22 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.