PTQ4DiT: Post-training Quantization for Diffusion Transformers (2405.16005v3)
Abstract: The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $\rho$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- All are worth words: a vit backbone for score-based diffusion models. In NeurIPSW, 2022.
- A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
- Video generation models as world simulators. 2024.
- End-to-end object detection with transformers. In ECCV, 2020.
- Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024.
- Diffusion models in vision: A survey. IEEE TPAMI, 2023.
- Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022.
- Diffusion models beat gans on image synthesis. In NeurIPS, volume 34, pages 8780–8794, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Learned step size quantization. In ICLR, 2020.
- Jumping through local minima: Quantization in the loss landscape of vision transformers. In ICCV, pages 16978–16988, 2023.
- Masked diffusion transformer is a strong image synthesizer. In ICCV, pages 23164–23173, 2023.
- Ptqd: Accurate post-training quantization for diffusion models. In NeurIPS, volume 36, 2024.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, volume 30, 2017.
- Denoising diffusion probabilistic models. In NeurIPS, volume 33, pages 6840–6851, 2020.
- Q-diffusion: Quantizing diffusion models. In ICCV, pages 17535–17545, 2023.
- Q-dm: An efficient low-bit quantized diffusion model. In NeurIPS, volume 36, 2024.
- Brecq: Pushing the limit of post-training quantization by block reconstruction. In ICLR, 2021.
- Repq-vit: Scale reparameterization for post-training quantization of vision transformers. In ICCV, pages 17227–17236, 2023.
- Awq: Activation-aware weight quantization for llm compression and acceleration. In MLSys, 2024.
- Pd-quant: Post-training quantization based on prediction difference metric. In CVPR, pages 24427–24437, 2023.
- Qllm: Accurate and efficient low-bitwidth quantization for large language models. In ICLR, 2024.
- Oscillation-free quantization for low-bit vision transformers. In ICML, 2023.
- Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers. In CVPR, pages 20321–20330, 2023.
- Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Up or down? adaptive rounding for post-training quantization. In ICML, pages 7197–7206, 2020.
- A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
- Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
- Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171. PMLR, 2021.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, volume 32, 2019.
- Scalable diffusion models with transformers. In CVPR, pages 4195–4205, 2023.
- Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
- Imagenet large scale visual recognition challenge. IJCV, 115:211–252, 2015.
- Improved techniques for training gans. In NeurIPS, volume 29, 2016.
- Pb-llm: Partially binarized large language models. In ICLR, 2024.
- Post-training quantization on diffusion models. In CVPR, pages 1972–1981, 2023.
- Omniquant: Omnidirectionally calibrated quantization for large language models. In ICLR, 2024.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
- Charles Spearman. The proof and measurement of association between two things. 1961.
- Training data-efficient image transformers and distillation through attention. In ICML, 2021.
- Attention is all you need. In NeurIPS, 2017.
- Quest: Low-bit diffusion model quantization via efficient selective finetuning. arXiv preprint arXiv:2402.03666, 2024.
- Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization. In ICLR, 2022.
- Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. In EMNLP, 2023.
- Outlier suppression: Pushing the limit of low-bit transformer language models. In NeurIPS, 2022.
- Token transformation matters: Towards faithful post-hoc explanation for vision transformer. In CVPR, 2024.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML, pages 38087–38099. PMLR, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, volume 34, pages 12077–12090, 2021.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
- Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022.
- Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821, 2023.
- Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In ECCV, pages 191–207, 2022.
- Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.