Emergent Mind

Abstract

In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the weaker' baseline to astronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

Training strategy speeds up model convergence in VAE transition, higher resolution adjustment, and KV compression.

Overview

  • \'PixArt-Σ\' introduces a novel Diffusion Transformer model capable of generating 4K resolution images from textual descriptions, using a 'weak-to-strong training' method.

  • The model utilizes a high-quality dataset including 33M high-resolution images and sophisticated captions, enhancing text-image alignment and reducing hallucinations.

  • It incorporates an efficient token compression mechanism via group convolutions and a specialized weight initialization strategy to efficiently generate high-resolution images.

  • Empirical validation shows \'PixArt-Σ\' produces photo-realistic images that closely adhere to user prompts, setting new benchmarks in Text-to-Image synthesis.

\model: Advancing High-Resolution Text-to-Image Generation with Efficient Training

Introduction to \model

Recent research has introduced \model, a cutting-edge Diffusion Transformer model (DiT) designed with the capability of generating 4K resolution images from textual descriptions. This model embarks on a novel approach, termed "weak-to-strong training," leveraging high-quality data integration and efficient training techniques to enhance the base functionality of its precursor, \modelalpha. With only 0.6B parameters, \model achieves superior performance in generating high-fidelity images closely aligned with text prompts, establishing a new benchmark in the Text-to-Image (T2I) synthesis domain.

High-Quality Training Data

One of the key advancements of \model lies in its use of improved training data, which includes:

  • High-quality images: Encompassing 33M high-resolution images sourced from the Internet, with a significant proportion in 4K resolution, this dataset not only amplifies the aesthetic quality of generated images but also spans a wide range of artistic styles.
  • Dense and accurate captions: The model utilizes captions generated by a more sophisticated image captioner compared to \modelalpha. This, coupled with an elongated token processing length for the text encoder, markedly reduces hallucinations and enriches text-image alignment.

Efficient Token Compression

Addressing the computational challenges in generating ultra-high-resolution images, \model incorporates a novel attention module that allows for the efficient compression of keys and values within the DiT framework. This is achieved through group convolutions and a specialized weight initialization strategy, significantly cutting down on computational costs and enabling the model to generate 4K resolution images efficiently.

From Weak to Strong: A Novel Training Strategy

\model advances through a series of fine-tuning stages, efficiently transitioning from the foundational capabilities of \modelalpha to achieve significantly enhanced performance. Noteworthy strategies include:

  • Adapting to more powerful VAEs and higher resolutions with minimal additional training, thanks to effective initialization techniques.
  • Implementing Key-Value (KV) Token Compression to address computational complexity, facilitating high-resolution image generation.

Empirical Validation and Implications

The advancements in \model are empirically validated through comparisons with both open-source models and leading commercial T2I products. The model's ability to generate photo-realistic images, which adhere closely to user prompts and showcase a high level of aesthetic quality, places it on par with, if not superior to, contemporary T2I models.

Furthermore, the innovative "weak-to-strong training" approach not only exemplifies the model's efficiency in integrating new data and techniques but also opens up avenues for future developments in AI, particularly in the realms of content creation and generative models.

Conclusion

\model marks a significant leap forward in the generation of high-resolution images from text, accomplished through innovative data utilization and efficient training methodologies. Its exceptional performance, coupled with lower computational requirements, sets a new precedent in the field of AI-generated content, promising a wide array of applications across various industries. The model's success in adhering to complex prompts and its competency in generating 4K images underscore its potential as a pivotal tool in the ongoing evolution of generative AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.