We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.
CosmicMan is a new text-to-image (T2I) foundation model specialized in generating high-fidelity human images, outperforming general-purpose models in terms of appearance, structure, and text-image alignment.
The model is powered by the CosmicMan-HQ dataset, built on the Annotate Anyone paradigm which combines human expertise and AI for continuous, high-quality human-centric data creation.
CosmicMan employs the Decomposed-Attention-Refocusing (Daring) training framework, featuring Data Discretization and HOLA Loss techniques for enhanced learning and alignment in image generation.
The model shows superior performance in generating human images over existing foundation models and offers practical advantages in applications like 2D image editing and 3D human reconstruction.
The advent of text-to-image (T2I) foundation models like DALLE, Imagen, and Stable Diffusion (SD) has significantly advanced the capabilities in image generation tasks. These models, benefiting from extensive image-text datasets and sophisticated generative algorithms, have showcased impressive ability in generating images with remarkable fidelity and detail. However, their application in human-centric content generation exhibits a critical limitation: the lack of a specialized foundation model focusing exclusively on human subjects.
To address this, we introduce CosmicMan, a T2I foundation model dedicated to generating high-fidelity human images. CosmicMan outperforms general-purpose models by ensuring meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions for human images.
The effectiveness of CosmicMan stems from the CosmicMan-HQ dataset, constructed via a novel data production paradigm named Annotate Anyone, emphasizing human-AI collaboration. This paradigm ensures the ongoing creation of high-quality human-centric data, aligning with the complex requirements of human image generation.
Annotate Anyone introduces a systematic, scalable approach to data collection and annotation that leverages both human expertise and AI capabilities. This paradigm involves two primary stages:
The outcome is the CosmicMan-HQ dataset, which comprises 6 million high-resolution images annotated with $115$ million attributes, providing a robust foundation for the CosmicMan model.
CosmicMan leverages the Daring training framework, which is designed to be both effective and straightforward to integrate into downstream tasks. Key innovations of Daring include:
In comparing CosmicMan to state-of-the-art foundation models, it demonstrates superior capabilities in generating human images with improved fidelity and alignment. Extensive ablation studies validate the contributions of the Annotate Anyone paradigm and the Daring training framework to the model's performance.
Furthermore, application tests in 2D human image editing and 3D human reconstruction highlight the practical advantages of CosmicMan as a specialized foundation model for human-centric tasks.
CosmicMan represents a significant step forward in the specialization of text-to-image foundation models for human-centered applications. By addressing the unique challenges of human image generation, CosmicMan sets a new benchmark for future research in this domain.
As part of our long-term commitment, we plan to continually update both the CosmicMan-HQ dataset and the CosmicMan model, ensuring they remain at the forefront of advancements in human image generation technology.