TerDiT: Ternary Diffusion Models with Transformers (2405.14854v2)

Published 23 May 2024 in cs.CV and cs.LG

Abstract: Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion transformer models (DiTs). Among diffusion models, diffusion transformers have demonstrated superior image-generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their excessive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models, such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, we propose TerDiT, the first quantization-aware training (QAT) and efficient deployment scheme for extremely low-bit diffusion transformer models. We focus on the ternarization of DiT networks, with model sizes ranging from 600M to 4.2B, and image resolution from 256$\times$256 to 512$\times$512. Our work contributes to the exploration of efficient deployment of large-scale DiT models, demonstrating the feasibility of training extremely low-bit DiT models from scratch while maintaining competitive image generation capacities compared to full-precision models. Our code and pre-trained TerDiT checkpoints have been released at https://github.com/Lucky-Lance/TerDiT.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a quantization-aware training scheme that converts diffusion transformers to ternary weights.
It employs an adaptive layer normalization modification to stabilize outputs and expedite training convergence.
The method achieves competitive FID scores with significantly reduced memory requirements, enabling deployment on resource-limited hardware.

An Expert Review of "TerDiT: Ternary Diffusion Models with Transformers"

The paper, "TerDiT: Ternary Diffusion Models with Transformers," introduces the TerDiT framework, focusing on the quantization-aware training (QAT) and efficient deployment of large-scale ternary diffusion models leveraging transformer architectures, namely Diffusion Transformers (DiTs). As diffusion transformers have shown superior image generation capabilities, achieving lower FID scores with larger parameter sizes, the potential for efficient deployment has become critical given the prohibitive computation and storage costs associated with these models.

Technical Overview

Diffusion models, particularly those utilizing transformer architectures, have set a new benchmark in high-quality image generation tasks. One of the primary challenges addressed by this paper is the efficient deployment of large-scale DiTs, which typically consist of hundreds of millions to several billion parameters. Existing research has focused on quantization methods for diffusion models, most notably with U-Net architectures, but there has been a lack of exploration into quantization for transformer-based diffusion models, a gap this paper aims to fill.

The TerDiT framework employs a quantization-aware training approach specifically tailored for ternary-weighted transformer models. It builds upon low-bit quantization strategies demonstrated successful in the training of LLMs, by introducing weight-only quantization strategies that convert model weights into ternary values, i.e., values are limited to -1, 0, and +1, with an added scaling factor. This scheme aims to significantly reduce the memory footprint and computational resource requirements for the deployment of such large models.

The authors propose a modification to the existing model architecture by incorporating a variant of adaptive layer normalization (adaLN) within the diffusion transformer block that uses root mean square normalization post quantization. This change is crucial for preserving performance and ensuring faster convergence during training by effectively stabilizing activation distributions during the training phase, which are otherwise skewed due to the ternary representation of weights.

Numerical Results and Claims

Several strong numerical results substantiate the claims of efficiency and effectiveness of the TerDiT scheme. The paper presents comprehensive comparisons of TerDiT models against full-precision diffusion models on the ImageNet image generation task. The ternary model with 4.2 billion parameters achieves FID scores (9.66 without guidance, 2.42 with classifier-free guidance) comparable to its full-precision counterpart, indicating minimal degradation in performance. Furthermore, the model size is reduced by an order of magnitude, with the TerDiT-4.2B model requiring less than 3GB of GPU memory during inference, contrasting starkly with the 16GB otherwise needed for its full-precision analog.

The paper also suggests the feasibility of a more substantial parameter scaling following optimization, implying that larger ternary models could further bridge performance gaps typically observed between full-precision and quantized models under similar constraints.

Implications and Future Directions

Practically, the findings point toward effective deployment of advanced image-generating models on resource-limited hardware, such as mobile devices, by minimizing the high computational and memory requirements associated with large diffusion transformers. This is particularly relevant for real-world applications where deploying highly complex models in constrained environments remains a priority.

On a theoretical front, the employment of QAT in quantizing DiT models hints at a substantial precision redundancy in large-scale neural models, echoing similar findings in LLMs. This underscores a potential area of research focused on model efficiency concerning precision without compromising qualitative performance.

For future directions, the paper highlights the necessity for infrastructure support capable of leveraging the computational advantages that ternary weight networks can provide. Moreover, extending this work to other modalities—such as text-to-image tasks—and holistic integration into standardized development pipelines for AI workloads may be explored.

The paper makes a significant contribution to addressing the efficiency and deployment challenges in large-scale diffusion models, and its findings offer a foundation for future advancements in the efficient execution of AI models across diverse applications and hardware configurations.

PDF Markdown

Related Papers

GitHub

GitHub - Lucky-Lance/TerDiT (58 stars)

Tweets

https://twitter.com/susumuota/status/1797781480225087673

Reddit

TerDiT: Ternary Diffusion Models with Transformers (49 points, 1 comment)