Make It Count: Text-to-Image Generation with an Accurate Number of Objects (2406.10210v1)

Published 14 Jun 2024 in cs.CV, cs.AI, and cs.GR

Abstract: Despite the unprecedented success of text-to-image diffusion models, controlling the number of depicted objects using text is surprisingly hard. This is important for various applications from technical documents, to children's books to illustrating cooking recipes. Generating object-correct counts is fundamentally challenging because the generative model needs to keep a sense of separate identity for every instance of the object, even if several objects look identical or overlap, and then carry out a global computation implicitly during generation. It is still unknown if such representations exist. To address count-correct generation, we first identify features within the diffusion model that can carry the object identity information. We then use them to separate and count instances of objects during the denoising process and detect over-generation and under-generation. We fix the latter by training a model that predicts both the shape and location of a missing object, based on the layout of existing ones, and show how it can be used to guide denoising with correct object count. Our approach, CountGen, does not depend on external source to determine object layout, but rather uses the prior from the diffusion model itself, creating prompt-dependent and seed-dependent layouts. Evaluated on two benchmark datasets, we find that CountGen strongly outperforms the count-accuracy of existing baselines.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces CountGen, demonstrating significantly improved control over object counts in generated images through novel instance segmentation and layout correction.
The paper leverages detailed analysis of SDXL’s self-attention layers to identify object instances early, enabling precise count correction with the ReLayout network.
The paper achieves superior accuracy, with CountGen reaching 54% in human evaluation on CoCoCount, markedly outperforming existing models like SDXL and DALL-E 3.

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

The paper "Make It Count: Text-to-Image Generation with an Accurate Number of Objects" addresses a significant gap in the current capabilities of text-to-image diffusion models: the ability to generate images with an explicitly defined number of objects. Despite their success, these models often struggle with precisely controlling the number of objects in response to textual prompts. This control becomes crucial for applications across various domains, including technical documentation, educational resources, and media content creation.

Contributions

The authors propose a novel approach named CountGen, comprising several innovations designed to enhance the count-accuracy in text-to-image generation. The methodology encompasses three primary steps:

Identification of Objectness and Instance Identity in SDXL: The authors meticulously analyze the self-attention layers of the SDXL model to identify features that represent objectness and individual instances. They pinpointed that layer $l^{up}_{52}$ at timestep $t=500$ provides a robust separation of object instances, which is utilized to perform object instance segmentation early in the denoising process.
ReLayout Network: To correct count inaccuracies, the authors developed ReLayout—a network trained to predict a new layout with the correct number of objects while preserving the original scene's spatial composition. This network leverages a dataset of image pairs with slight object count variations, focusing on maintaining structural coherence during incremental object addition.
Layout-Guided Image Generation: Combining the identified object layouts and the ReLayout corrections, the authors implement a dual-optimization process during generation. This includes a weighted binary cross-entropy loss to guide object placement and self-attention masking to inhibit unwanted object creation in background areas.

Evaluation

CountGen demonstrates a significant leap in accuracy over existing models. The evaluations were conducted on two benchmark datasets: T2I-CompBench-Count and the newly introduced CoCoCount. The results indicate a substantial improvement in generating the correct number of objects:

Human Evaluation: CountGen achieved 54% accuracy on CoCoCount, compared to 26% by SDXL and 38% by DALL-E 3.
Automatic Evaluation: Using YOLOv9 for objective assessment, CountGen outperformed other baselines, registering a count accuracy of 50%, a notable increase from the 28% seen in the SDXL model.

The qualitative assessments revealed that CountGen not only improves count accuracy but also maintains high image quality. Instances where the generated image deviated from the target count were significantly reduced.

Implications and Future Directions

CountGen's advancements have dual implications:

Practical Implications: The improved control over object counts in generated images enhances the reliability of AI-generated content in domains requiring precision, such as technical illustrations, educational materials, and advertising.
Theoretical Implications: The paper reveals insights into the internal representations of object instances within diffusion models, contributing to the broader understanding of objectness in generative AI.

The research also opens avenues for future exploration:

Extending the model's capabilities to handle varied and more complex scene compositions involving multiple types of objects.
Investigating the potential of integrating external spatial priors to further refine layout accuracy and coherence.
Developing more sophisticated training datasets that simulate a wider array of real-world scenarios, ensuring robustness across diverse applications.

Conclusion

The authors present a methodical and technologically advanced approach to addressing a fundamental challenge in text-to-image generation. By leveraging internal diffusion model features and pioneering new corrective methods, CountGen achieves significant enhancements in count accuracy without compromising visual quality. This represents a meaningful step forward for both practical applications and theoretical research in AI-driven generative modeling.

PDF Markdown

Related Papers

Tweets

https://twitter.com/RoyiRassin/status/1802737854990164192

https://twitter.com/gm8xx8/status/1802516438667722793