Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 41 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 178 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

CoIE: Chain-of-Instruct Editing for Multi-Attribute Face Manipulation (2312.07879v2)

Published 13 Dec 2023 in cs.CV and cs.AI

Abstract: Current text-to-image editing models often encounter challenges with smoothly manipulating multiple attributes using a single instruction. Taking inspiration from the Chain-of-Thought prompting technique utilized in LLMs, we present an innovative concept known as Chain-of-Instruct Editing (CoIE), which enhances the capabilities of these models through step-by-step editing using a series of instructions. In particular, in the context of face manipulation, we leverage the contextual learning abilities of a pretrained LLM, such as GPT-4, to generate a sequence of instructions from the original input, utilizing a purpose-designed 1-shot template. To further improve the precision of each editing step, we conduct fine-tuning on the editing models using our self-constructed instruction-guided face editing dataset, Instruct-CelebA. And additionally, we incorporate a super-resolution module to mitigate the adverse effects of editability and quality degradation. Experimental results across various challenging cases confirm the significant boost in multi-attribute facial image manipulation using chain-of-instruct editing. This is evident in enhanced editing success rates, measured by CLIPSim and Coverage metrics, improved by 17.86% and 85.45% respectively, and heightened controllability indicated by Preserve L1 and Quality metrics, improved by 11.58% and 4.93% respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Blended Diffusion for Text-driven Editing of Natural Images. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18187–18197.
  2. Text2LIVE: Text-Driven Layered Image and Video Editing. In Computer Vision – ECCV 2022, 707–723. Cham: Springer Nature Switzerland.
  3. InstructPix2Pix: Learning To Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18392–18402.
  4. DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation. arXiv:2307.00300.
  5. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. In European Conference on Computer Vision.
  6. DiffusionRig: Learning Personalized Priors for Facial Appearance Editing. arXiv:2304.06711.
  7. Prompt-to-Prompt Image Editing with Cross-Attention Control. In The Eleventh International Conference on Learning Representations.
  8. Imagic: Text-Based Real Image Editing with Diffusion Models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  9. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2416–2425.
  10. Large Language Models are Zero-Shot Reasoners. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Advances in Neural Information Processing Systems.
  11. CLIPstyler: Image Style Transfer with a Single Text Condition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18041–18050.
  12. MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. arXiv:1907.11922.
  13. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597.
  14. Decoupled Weight Decay Regularization. arXiv:1711.05101.
  15. Cycle Encoding of a StyleGAN Encoder for Improved Reconstruction and Editability. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, 2032–2041. New York, NY, USA: Association for Computing Machinery. ISBN 9781450392037.
  16. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.
  17. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 975–984. Association for Computational Linguistics.
  18. Null-text Inversion for Editing Real Images using Guided Diffusion Models. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  19. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  20. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2065–2074.
  21. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning.
  22. High-Resolution Image Synthesis with Latent Diffusion Models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10674–10685.
  23. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. ArXiv, abs/2205.11487.
  24. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4149–4158. Association for Computational Linguistics.
  25. SER-FIQ: Unsupervised Estimation of Face Image Quality Based on Stochastic Embedding Robustness. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5650–5659.
  26. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
  27. Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data. arXiv:2107.10833.
  28. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems.
  29. From Continuity to Editability: Inverting GANs With Consecutive Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13910–13918.
  30. ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation. arXiv:2305.14742.
  31. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. arXiv:2306.10012.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.