We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
Diffusion Transformers (DiTs) integrate transformer architectures with diffusion models for generative modeling, focusing on the latent representations of images to achieve high-resolution image generation.
The DiT architecture modifies the Vision Transformer design to support diffusion processes, introducing adaLN-Zero blocks for improved performance and leveraging model scalability for enhanced image fidelity.
Empirical evaluations demonstrate DiTs' superiority in image generation quality and training efficiency over existing models, particularly on challenging ImageNet benchmarks.
The study highlights the theoretical and practical implications of using transformers in diffusion models and suggests future research directions for further advancements in generative modeling.
A new paradigm in generative modeling has emerged through the integration of transformer architectures with diffusion models, coined as Diffusion Transformers (DiTs). These models operate by training on latent representations of images, thus deviating from the conventional U-Net-based backbone widely adopted in prior diffusion models. The research presented demonstrates a significant leap in the image generative domain, leveraging the scalability and adaptability of transformers to achieve unprecedented results in high-resolution image generation.
At the core of DiTs is a visionary adaptation of the Vision Transformer (ViT) design, tailored to accommodate the unique demands of diffusion models. This adaptation encompasses a transformer network that directly manipulates latent patches of images, introducing a novel approach to image synthesis. The architecture's scalability is thoroughly analyzed through variations in transformer depth, width, and the number of input tokens, revealing a robust correlation between model complexity and image fidelity, as quantified by the Frechet Inception Distance (FID) metrics.
A distinctive aspect of the DiT's design is the introduction of adaLN-Zero blocks, a nuanced variation of adaptive layer normalization, showcasing an innovative initialization strategy that ensures each DiT block commences as the identity function. This design choice has proven instrumental in enhancing model performance, marking a departure from traditional practices in the genre.
Extensive empirical evaluations underline the effectiveness of DiTs across multiple dimensions:
The integration of transformers within the diffusion paradigm presents compelling theoretical and practical implications. Theoretically, it extends the applicability of transformers, showcasing their versatility beyond language and conventional vision tasks. Practically, it sets a new benchmark in image generation, with potential applications spanning content creation, digital art, and beyond. Additionally, the architecture's scalability hints at the untapped potential awaiting further exploration in larger models and expansive datasets.
The research on DiTs presents a foundational step towards harnessing the full potential of transformers in generative modeling. Future avenues may include exploring cross-domain applications, enhancing model efficiency, and further pushing the boundaries of image quality and diversity. As the model continues to evolve, it stands to significantly influence the trajectory of generative model research and applications.
This research was made possible through contributions from across the academic community, with special thanks extended to team members and supporting institutions for their invaluable input and support.
Diffusion Transformers (DiTs) emerge as a powerful new class of generative models, bridging the capabilities of transformers with the nuanced requirements of diffusion-based image generation to achieve unparalleled results. Through rigorous architectural innovation and empirical validation, DiTs redefine the horizons of what's possible in the realm of generative AI, setting a new standard for image quality and model scalability in the process.