PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation (2403.04692v2)

Published 7 Mar 2024 in cs.CV

Abstract: In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the weaker' baseline to astronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

References (7)

Citations (36)

View on Semantic Scholar

Summary

The paper introduces a novel weak-to-strong training strategy that enhances a diffusion transformer to generate high-fidelity 4K images.
It leverages high-quality training data and an efficient token compression mechanism using group convolutions and specialized weight initialization.
Empirical validation shows PixArt-Σ outperforms current models in text-to-image synthesis with only 0.6B parameters.

\model: Advancing High-Resolution Text-to-Image Generation with Efficient Training

Introduction to \model

Recent research has introduced \model, a cutting-edge Diffusion Transformer model (DiT) designed with the capability of generating 4K resolution images from textual descriptions. This model embarks on a novel approach, termed "weak-to-strong training," leveraging high-quality data integration and efficient training techniques to enhance the base functionality of its precursor, \modelalpha. With only 0.6B parameters, \model achieves superior performance in generating high-fidelity images closely aligned with text prompts, establishing a new benchmark in the Text-to-Image (T2I) synthesis domain.

High-Quality Training Data

One of the key advancements of \model lies in its use of improved training data, which includes:

High-quality images: Encompassing 33M high-resolution images sourced from the Internet, with a significant proportion in 4K resolution, this dataset not only amplifies the aesthetic quality of generated images but also spans a wide range of artistic styles.
Dense and accurate captions: The model utilizes captions generated by a more sophisticated image captioner compared to \modelalpha. This, coupled with an elongated token processing length for the text encoder, markedly reduces hallucinations and enriches text-image alignment.

Efficient Token Compression

Addressing the computational challenges in generating ultra-high-resolution images, \model incorporates a novel attention module that allows for the efficient compression of keys and values within the DiT framework. This is achieved through group convolutions and a specialized weight initialization strategy, significantly cutting down on computational costs and enabling the model to generate 4K resolution images efficiently.

From Weak to Strong: A Novel Training Strategy

\model advances through a series of fine-tuning stages, efficiently transitioning from the foundational capabilities of \modelalpha to achieve significantly enhanced performance. Noteworthy strategies include:

Adapting to more powerful VAEs and higher resolutions with minimal additional training, thanks to effective initialization techniques.
Implementing Key-Value (KV) Token Compression to address computational complexity, facilitating high-resolution image generation.

Empirical Validation and Implications

The advancements in \model are empirically validated through comparisons with both open-source models and leading commercial T2I products. The model's ability to generate photo-realistic images, which adhere closely to user prompts and showcase a high level of aesthetic quality, places it on par with, if not superior to, contemporary T2I models.

Furthermore, the innovative "weak-to-strong training" approach not only exemplifies the model's efficiency in integrating new data and techniques but also opens up avenues for future developments in AI, particularly in the realms of content creation and generative models.

Conclusion

\model marks a significant leap forward in the generation of high-resolution images from text, accomplished through innovative data utilization and efficient training methodologies. Its exceptional performance, coupled with lower computational requirements, sets a new precedent in the field of AI-generated content, promising a wide array of applications across various industries. The model's success in adhering to complex prompts and its competency in generating 4K images underscore its potential as a pivotal tool in the ongoing evolution of generative AI.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1765927709866172667

https://twitter.com/arankomatsuzaki/status/1765927349999018165

https://twitter.com/fly51fly/status/1766946529447920038

https://twitter.com/isidentical/status/1794471564357861532

https://twitter.com/dimid_ml/status/1771190106164285453

https://twitter.com/arxivsanitybot/status/1766817480771768538

HackerNews

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for Text-to-Image (1 point, 0 comments)

Reddit

PixArt-Σ:Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation Paper is Released (34 points, 11 comments)