Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 72 tok/s
Gemini 3.0 Pro 51 tok/s Pro
Gemini 2.5 Flash 147 tok/s Pro
Kimi K2 185 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning (2404.15449v1)

Published 23 Apr 2024 in cs.CV and cs.AI

Abstract: The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity characteristics of reference portraits accurately, (2) The generated images lack aesthetic appeal especially while enforcing identity retention, and (3) There is a limitation that cannot be compatible with LoRA-based and Adapter-based methods simultaneously. To address these issues, we present \textbf{ID-Aligner}, a general feedback learning framework to enhance ID-T2I performance. To resolve identity features lost, we introduce identity consistency reward fine-tuning to utilize the feedback from face detection and recognition models to improve generated identity preservation. Furthermore, we propose identity aesthetic reward fine-tuning leveraging rewards from human-annotated preference data and automatically constructed feedback on character structure generation to provide aesthetic tuning signals. Thanks to its universal feedback fine-tuning framework, our method can be readily applied to both LoRA and Adapter models, achieving consistent performance gains. Extensive experiments on SD1.5 and SDXL diffusion models validate the effectiveness of our approach. \textbf{Project Page: \url{https://idaligner.github.io/}}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Training Diffusion Models with Reinforcement Learning. arXiv:2305.13301 [cs.LG]
  2. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800 (2022).
  3. PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models. ([n. d.]).
  4. Taming Transformers for High-Resolution Image Synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr46437.2021.01268
  5. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).
  6. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]
  7. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL]
  8. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. (Aug 2023).
  9. Composer: Creative and Controllable Image Synthesis with Composable Conditions. (Feb 2023).
  10. Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arXiv:1312.6114 [stat.ML]
  11. PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding. (Dec 2023).
  12. Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning. (Jul 2023).
  13. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073 [cs.CV]
  14. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023).
  15. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741 [cs.CV]
  16. OpenAI. 2023. Introducing chatgpt. arXiv:2303.08774 [cs.CL]
  17. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  18. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
  19. JourneyDB: A Benchmark for Generative Image Understanding. arXiv:2307.00716 [cs.CV]
  20. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. ([n. d.]).
  21. UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild. arXiv:2305.11147 [cs.CV]
  22. Learning Transferable Visual Models From Natural Language Supervision. Cornell University - arXiv,Cornell University - arXiv (Feb 2021).
  23. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 [cs.CV]
  24. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV]
  25. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation.
  26. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487 [cs.CV]
  27. FaceNet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2015.7298682
  28. LAION-5B: An open large-scale dataset for training next generation image-text models. ([n. d.]).
  29. InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning. (Apr 2023).
  30. Denoising Diffusion Implicit Models. arXiv:2010.02502 [cs.LG]
  31. Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation. arXiv preprint arXiv:2211.12572 (2022).
  32. InstantID: Zero-shot Identity-Preserving Generation in Seconds. arXiv:2401.07519 [cs.CV]
  33. ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation. (Feb 2023).
  34. Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis. arXiv:2306.09341 [cs.CV]
  35. Human Preference Score: Better Aligning Text-to-Image Models with Human Preference. arXiv:2303.14420 [cs.CV]
  36. FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. ([n. d.]).
  37. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. arXiv:2304.05977 [cs.CV]
  38. FaceStudio: Put Your Face Everywhere in Seconds. (Dec 2023).
  39. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models. arXiv:2308.06721 [cs.CV]
  40. UniFL: Improve Stable Diffusion via Unified Feedback Learning. ([n. d.]).
  41. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. IEEE Signal Processing Letters (Oct 2016), 1499–1503. https://doi.org/10.1109/lsp.2016.2603342
  42. Adding Conditional Control to Text-to-Image Diffusion Models. arXiv:2302.05543 [cs.CV]
Citations (7)

Summary

  • The paper introduces a feedback learning framework using identity consistency and aesthetic rewards to improve identity preservation in text-to-image generation.
  • The methodology integrates face recognition and human preference data, demonstrating compatibility with both LoRA and Adapter models while reducing training time.
  • Quantitative results show improved Face Sim, DINO, CLIP-I, and LAION-Aes scores compared to competing methods across various diffusion model architectures.

Enhancing Identity-Preserving Text-to-Image Generation

Introduction

The paper "ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning" (2404.15449) addresses the challenges of identity-preserving text-to-image (ID-T2I) generation, focusing on maintaining identity characteristics, enhancing aesthetic appeal, and ensuring compatibility with both LoRA and Adapter methods. ID-T2I has vast applications, from AI portraits to advertising, necessitating accurate identity retention and pleasing aesthetics in generated images. Although existing methods like LoRA and IP-Adapter provide satisfactory personalization, they struggle with identity preservation and aesthetic quality, and lack a unified approach compatible with multiple frameworks. Figure 1

Figure 1: An overview of the ID-Aligner method incorporating feedback learning for identity preservation and aesthetic enhancement.

ID-Aligner proposes a feedback learning framework using identity consistency rewards and identity aesthetic rewards. These mechanisms prioritize accurate identity preservation by employing face recognition models and human preference data while ensuring visual appeal through structured feedback. The method's compatibility across different model architectures promises a comprehensive solution to improve identity retention and aesthetic presentation simultaneously, demonstrating performance improvements across various experiments using SD1.5 and SDXL models.

Methodology

ID-Aligner utilizes feedback learning in diffusion models to address identity features and aesthetics concurrently.

Text-to-Image Diffusion Model

Diffusion models generate images by transforming Gaussian noise into structured data through iterative denoising. This process utilizes pre-trained VAE encoders to obtain latent representations, which are progressively refined using UNet models conditioned on text prompts. The denoising objective minimizes deviation from real-world data, forming the basis of ID-T2I frameworks.

Identity Reward

The identity reward system involves two components:

  • Identity Consistency Reward: Implements face detection and recognition models to calculate cosine similarity between embeddings from reference and generated faces, providing direct feedback for identity alignment during generation (Equation \ref{eq:idsim}).
  • Identity Aesthetic Reward: Combines appeal scoring via human-annotated data and structure validation using ControlNet-generated negative samples (Figure 2), guiding the model to produce aesthetically pleasing images (Equation \ref{eq:reward_training}). Figure 2

    Figure 2: Construction of aesthetic feedback data using manual and automatic methods to improve image structures and appeal.

The feedback guiding the UNet model ensures adherence to identity fidelity and enhances visual aesthetics through systematic preference scoring.

Experimental Results

Extensive experiments validate ID-Aligner's effectiveness in both identity preservation and aesthetic enhancement.

Adapter and LoRA Models

In both SD1.5 and SDXL architectures, ID-Aligner exhibits superior performance over existing methods (Figure 3). Notably, when integrated with LoRA models, ID-Aligner accelerates identity adaptation, reducing training time and increasing identity similarity and aesthetic quality, demonstrating adaptability across model variations (Figure 4). Figure 3

Figure 3: Comparison of identity conditional generation results showcasing enhanced identity retention and aesthetics with ID-Aligner.

Figure 4

Figure 4: Visual results from LoRA models highlighting accelerated training and improved identity preservation.

Quantitative Analysis

ID-Aligner achieves higher Face Sim., DINO, CLIP-I, and LAION-Aes scores than competing methods, indicating improved identity alignment and aesthetic appeal. The model's adaptability to diverse prompts further showcases its robustness (Table \ref{tab:quantitative_results}).

Conclusion

The ID-Aligner framework effectively enhances identity-preserving text-to-image generation by integrating feedback learning models. By addressing critical challenges in identity retention and aesthetics, it sets new standards for future research in adaptive generative models, promoting broader applications across domains like portrait generation, virtual environments, and personalized advertising.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 185 likes.

Upgrade to Pro to view all of the tweets about this paper:

Reddit Logo Streamline Icon: https://streamlinehq.com