Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers (2403.19591v2)
Abstract: Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of integer-only INT quantization. This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness. The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models. Besides, proposed GQA-LUT enables the employment of INT8-based LUT-Approximation that achieves an area savings of 81.3~81.7% and a power reduction of 79.3~80.2% compared to the high-precision FP/INT 32 alternatives. Code is available at https:// github.com/PingchengDong/GQA-LUT.
- Jacob Devlin et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17302–17313, October 2023.
- Augmented fcn: rethinking context modeling for semantic segmentation. Science China Information Sciences, 66(4):142105, 2023.
- I-bert: Integer-only bert quantization. In International conference on machine learning, pages 5506–5518. PMLR, 2021.
- Oscillation-free quantization for low-bit vision transformers. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 21813–21824. PMLR, 23–29 Jul 2023.
- Multcim: Digital computing-in-memory-based multimodal transformer accelerator with attention-token-bit hybrid sparsity. IEEE Journal of Solid-State Circuits, 2023.
- Softermax: Hardware/software co-design of an efficient softmax for transformers. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 469–474. IEEE, 2021.
- Nn-lut: neural approximation of non-linear operations for efficient transformer inference. In 2023 59th ACM/IEEE Design Automation Conference (DAC), pages 577–582, 2022.
- Range-invariant approximation of non-linear operations for efficient bert fine-tuning. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2023.
- Sdq: Stochastic differentiable quantization with mixed precision. In International Conference on Machine Learning, pages 9295–9309. PMLR, 2022.
- A tiny accelerator for mixed-bit sparse cnn based on efficient fetch method of simo spad. IEEE Transactions on Circuits and Systems II: Express Briefs, 70(8):3079–3083, 2023.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Hydra attention: Efficient attention with many heads. In European Conference on Computer Vision, pages 35–49. Springer, 2022.
- Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5961–5971, 2023.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Steven K. Esser et al. Learned step size quantization. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
- John Holland. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
- Graph reasoning transformer for image parsing. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2380–2389, 2022.