Training-Free Consistent Text-to-Image Generation (2402.03286v3)

Published 5 Feb 2024 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

Citations (31)

View on Semantic Scholar

Summary

The paper introduces ConsiStory, a training-free approach that harnesses subject-driven self-attention and feature injection to ensure visual subject consistency.
It manipulates internal diffusion features without fine-tuning, achieving generation that is roughly 20 times faster than state-of-the-art methods.
Empirical evaluations using CLIP and DreamSim scores, alongside user studies, confirm its superior prompt alignment and scalability for multi-subject scenarios.

Background

Advancements in generative AI, particularly large-scale text-to-image (T2I) diffusion models, have seen remarkable development, allowing the creation of imaginative scenes from textual descriptions. Despite their creative potential, ensuring consistent portrayal of subjects across varying prompts poses a significant challenge. The traditional methods, such as fine-tuning or image conditioning, typically demand substantial computational resources and may not offer multi-character consistency without trade-offs between consistency and prompt alignment.

Introducing ConsiStory

In the paper under discussion, the authors present "ConsiStory," a training-free method that facilitates the generation of visually consistent subjects across multiple prompts without the need for optimization or pre-training. By exploiting internal feature representations shared during diffusion based image generation, ConsiStory achieves cross-frame consistency a priori, that is, during the generative process rather than being imposed post hoc.

Technical Approach

The authors describe a technique that hinges upon subject-driven shared self-attention and correspondence-based feature injection. This approach, different from prior ones that relied on personalization or encoder-based tools, manipulates the diffusion model’s internal activations to align the generated images with other images rather than with an external source image. The process entails:

Localizing the subject across a range of noise-influenced generated images.
Enabling generated images to attend to other images' subject patches, facilitating subject consistency.
Implementing self-attention dropout and query-feature blending to enrich layout diversity.
Injecting features across images to enhance detailed consistency.

This allows for real-time generation that is roughly twenty times faster than the current state-of-the-art methods. Additionally, ConsiStory is capable of extending to multi-subject scenarios, providing a significant advantage over other methods that falter in these complex situations.

Performance Evaluation

ConsiStory was empirically compared to several baselines, displaying superior performance in subject consistency and prompt alignment without requiring costly training or backpropagation phases. The authors provide insights into a comprehensive series of evaluations:

Qualitative Assessments: Visual comparisons showcase that the proposed method preserves subject consistency while adhering to the prompts with remarkable finesse.
Quantitative Measurements: Employing both CLIP scores for prompt-alignment and DreamSim scores for consistency, alongside a detailed user paper, ConsiStory demonstrates its efficacy.

Practical Implications and Extensions

Several practical applications have been highlighted, such as compatibility with spatial control tools like ControlNet and the ability to perform training-free personalization for common objects. Although the technique thrives in many scenarios, it may face limitations with unusual styles or entail dependencies on certain model features for subject localization.

Conclusion

ConsiStory represents a significant leap in the generative model landscape, offering a swift and efficient alternative to the previous personalized text-to-image generation methods. With its innovative feature alignment strategies and emphasis on consistency, this model stands out as a meaningful tool for creators seeking to tell cohesive visual stories without extensive computational overheads.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1754705986391019698

https://twitter.com/cloneofsimo/status/1754885528363942388

https://twitter.com/multimodalart/status/1754794965366882481

https://twitter.com/ai_for_success/status/1754715812169187722

https://twitter.com/WilliamLamkin/status/1754715470102426053

https://twitter.com/jovisaib/status/1754793384785383731

YouTube

Show All Videos

HackerNews

Training-Free Consistent Text-to-Image Generation (2 points, 0 comments)