Emergent Mind

CosmicMan: A Text-to-Image Foundation Model for Humans

(2404.01294)
Published Apr 1, 2024 in cs.CV

Abstract

We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.

Comparison between CosmicMan-SDXL and SDXL pretrained model in 2D human editing using T2I-Adapter.

Overview

  • CosmicMan is a new text-to-image (T2I) foundation model specialized in generating high-fidelity human images, outperforming general-purpose models in terms of appearance, structure, and text-image alignment.

  • The model is powered by the CosmicMan-HQ dataset, built on the Annotate Anyone paradigm which combines human expertise and AI for continuous, high-quality human-centric data creation.

  • CosmicMan employs the Decomposed-Attention-Refocusing (Daring) training framework, featuring Data Discretization and HOLA Loss techniques for enhanced learning and alignment in image generation.

  • The model shows superior performance in generating human images over existing foundation models and offers practical advantages in applications like 2D image editing and 3D human reconstruction.

CosmicMan: Pioneering the Specialization of Text-to-Image Models in Human Image Generation

Introduction to CosmicMan

The advent of text-to-image (T2I) foundation models like DALLE, Imagen, and Stable Diffusion (SD) has significantly advanced the capabilities in image generation tasks. These models, benefiting from extensive image-text datasets and sophisticated generative algorithms, have showcased impressive ability in generating images with remarkable fidelity and detail. However, their application in human-centric content generation exhibits a critical limitation: the lack of a specialized foundation model focusing exclusively on human subjects.

To address this, we introduce CosmicMan, a T2I foundation model dedicated to generating high-fidelity human images. CosmicMan outperforms general-purpose models by ensuring meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions for human images.

CosmicMan-HQ Dataset Construction

The effectiveness of CosmicMan stems from the CosmicMan-HQ dataset, constructed via a novel data production paradigm named Annotate Anyone, emphasizing human-AI collaboration. This paradigm ensures the ongoing creation of high-quality human-centric data, aligning with the complex requirements of human image generation.

Annotate Anyone Paradigm

Annotate Anyone introduces a systematic, scalable approach to data collection and annotation that leverages both human expertise and AI capabilities. This paradigm involves two primary stages:

  1. Flowing Data Sourcing: By continuously monitoring a broad spectrum of internet sources alongside recycling academic datasets such as LAION-5B, SHHQ, and DeepFashion, Annotate Anyone ensures a diverse and expansive data pool.
  2. Human-in-the-loop Data Annotation: This iterative process involves human annotators refining AI-generated labels, focusing on attributes that fail to meet a predefined accuracy threshold, significantly reducing manual annotation costs while improving label quality.

The outcome is the CosmicMan-HQ dataset, which comprises 6 million high-resolution images annotated with $115$ million attributes, providing a robust foundation for the CosmicMan model.

Decomposed-Attention-Refocusing (Daring) Training Framework

CosmicMan leverages the Daring training framework, which is designed to be both effective and straightforward to integrate into downstream tasks. Key innovations of Daring include:

  • Data Discretization: By decomposing dense text descriptions into fixed groups aligned with human body structure, CosmicMan can more effectively learn the intricate relationships between textual concepts and their corresponding visual representations.
  • HOLA Loss: The Human Body and Outfit Guided Loss for Alignment (HOLA) focuses on improving text-image alignment at the group level, enhancing the model's ability to generate images conforming to detailed descriptions.

Evaluation and Applications

In comparing CosmicMan to state-of-the-art foundation models, it demonstrates superior capabilities in generating human images with improved fidelity and alignment. Extensive ablation studies validate the contributions of the Annotate Anyone paradigm and the Daring training framework to the model's performance.

Furthermore, application tests in 2D human image editing and 3D human reconstruction highlight the practical advantages of CosmicMan as a specialized foundation model for human-centric tasks.

Conclusion and Future Directions

CosmicMan represents a significant step forward in the specialization of text-to-image foundation models for human-centered applications. By addressing the unique challenges of human image generation, CosmicMan sets a new benchmark for future research in this domain.

As part of our long-term commitment, we plan to continually update both the CosmicMan-HQ dataset and the CosmicMan model, ensuring they remain at the forefront of advancements in human image generation technology.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.