SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing (2401.03433v1)
Abstract: Text-conditional image editing based on large diffusion generative model has attracted the attention of both the industry and the research community. Most existing methods are non-reference editing, with the user only able to provide a source image and text prompt. However, it restricts user's control over the characteristics of editing outcome. To increase user freedom, we propose a new task called Specific Reference Condition Real Image Editing, which allows user to provide a reference image to further control the outcome, such as replacing an object with a particular one. To accomplish this, we propose a fast baseline method named SpecRef. Specifically, we design a Specific Reference Attention Controller to incorporate features from the reference image, and adopt a mask mechanism to prevent interference between editing and non-editing regions. We evaluate SpecRef on typical editing tasks and show that it can achieve satisfactory performance. The source code is available on https://github.com/jingjiqinggong/specp2p.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020.
- Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” NeurIPS, vol. 32, 2019.
- A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML. PMLR, 2021, pp. 8162–8171.
- J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023.
- N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” arXiv preprint arXiv:2208.12242, 2022.
- R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
- J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation.” J. Mach. Learn. Res., vol. 23, pp. 47–1, 2022.
- Y. Huang, Y. Dong, H. Zhang, J. Huang, and S. Chen, “Learning image-adaptive lookup tables with spatial awareness for image harmonization,” IEEE Transactions on Consumer Electronics, pp. 1–1, 2023.
- Y. Liu, J. Huang, and S. Chen, “Deseal: Semantic-aware seal2clear attention for document seal removal,” IEEE Signal Processing Letters, vol. 30, pp. 1702–1706, 2023.
- E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: a stylegan encoder for image-to-image translation,” arXiv preprint arXiv:2008.00951, 2020.
- P. Zhu, R. Abdal, Y. Qin, and P. Wonka, “Improved stylegan embedding: Where are the good latents?” 2020.
- K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, and E. Raff, “Vqgan-clip: Open domain image generation and editing with natural language guidance,” in ECCV. Springer, 2022, pp. 88–105.
- A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
- C. Meng, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Image synthesis and editing with stochastic differential equations,” arXiv preprint arXiv:2108.01073, 2021.
- J. Huang, Y. Liu, J. Qin, and S. Chen, “Kv inversion: Kv embeddings learning for text-conditioned real image action editing,” 2023.
- S. Chen and J. Huang, “Fec: Three finetuning-free methods to enhance consistency for real image editing,” arXiv preprint arXiv:2309.14934, 2023.
- J. Huang, Y. Liu, Y. Huang, and S. Chen, “Seal2real: Prompt prior learning on diffusion model for unsupervised document seal data generation and realisation,” arXiv preprint arXiv:2310.00546, 2023.
- O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” arXiv preprint arXiv:2206.02779, 2022.
- J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
- G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in CVPR, 2022, pp. 2426–2435.
- Y. Huang, J. Huang, J. Liu, Y. Dong, J. Lv, and S. Chen, “Wavedm: Wavelet-based diffusion models for image restoration,” arXiv preprint arXiv:2305.13819, 2023.
- J. Huang, Y. Liu, and S. Chen, “Bootstrap diffusion model curve estimation for high resolution low-light image enhancement,” arXiv preprint arXiv:2309.14709, 2023.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2021.
- A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
- T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” arXiv preprint arXiv:2211.09800, 2022.
- M. Cao, X. Wang, Z. Qi, Y. Shan, X. Qie, and Y. Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” arXiv preprint arXiv:2304.08465, 2023.
- R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text inversion for editing real images using guided diffusion models,” arXiv preprint arXiv:2211.09794, 2022.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
- N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” arXiv preprint arXiv:2211.12572, 2022.
- H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” arXiv preprint arXiv:2301.13826, 2023.
- B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” arXiv preprint arXiv:2210.09276, 2022.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- Songyan Chen (3 papers)
- Jiancheng Huang (22 papers)