Tuning-Free Inversion-Enhanced Control for Consistent Image Editing (2312.14611v1)
Abstract: Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings.
- Blended latent diffusion. arXiv preprint arXiv:2206.02779.
- Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096.
- InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv preprint arXiv:2211.09800.
- MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. arXiv:2304.08465.
- Vqgan-clip: Open domain image generation and editing with natural language guidance. In ECCV, 88–105. Springer.
- Diffusion models beat gans on image synthesis. NeurIPS, 34: 8780–8794.
- Cogview: Mastering text-to-image generation via transformers. NeurIPS, 34: 19822–19835.
- CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers. arXiv preprint arXiv:2204.14217.
- Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models. arXiv preprint arXiv:2305.04441.
- Vector quantized diffusion model for text-to-image synthesis. In CVPR, 10696–10706.
- Improving Tuning-Free Real Image Editing with Proximal Guidance. CoRR.
- Deep residual learning for image recognition. In CVPR, 770–778.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
- Denoising diffusion probabilistic models. NeurIPS, 33: 6840–6851.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
- Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2426–2435.
- Manigan: Text-guided image manipulation. In CVPR, 7880–7889.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
- Null-text Inversion for Editing Real Images using Guided Diffusion Models. arXiv preprint arXiv:2211.09794.
- T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.08453.
- Text-adaptive generative adversarial networks: manipulating images with natural language. NeurIPS, 31.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
- Improved denoising diffusion probabilistic models. In ICML, 8162–8171. PMLR.
- Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold. In ACM SIGGRAPH 2023 Conference Proceedings.
- Zero-shot Image-to-Image Translation. arXiv preprint arXiv:2302.03027.
- Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2085–2094.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Zero-shot text-to-image generation. In ICML, 8821–8831. PMLR.
- Generative adversarial text to image synthesis. In ICML, 1060–1069. PMLR.
- High-resolution image synthesis with latent diffusion models. In CVPR, 10684–10695.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
- Generative modeling by estimating gradients of the data distribution. NeurIPS, 32.
- Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR, 16515–16525.
- Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. arXiv preprint arXiv:2211.12572.
- Attention is all you need. NeurIPS, 30.
- Tedigan: Text-guided diverse face image generation and manipulation. In CVPR, 2256–2265.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 1316–1324.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.
- IPDreamer: Appearance-Controllable 3D Object Generation with Image Prompts. arXiv preprint arXiv:2310.05375.
- Controllable Mind Visual Diffusion Model. arXiv preprint arXiv:2305.10135.
- Cross-modal contrastive learning for text-to-image generation. In CVPR, 833–842.
- Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE TPAMI, 41(8): 1947–1962.
- Adding Conditional Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.05543.
- Tigan: Text-based interactive image generation and manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 3580–3588.