Cost-Effective Attention Mechanisms for Low Resource Settings: Necessity & Sufficiency of Linear Transformations (2403.01643v3)
Abstract: From natural language processing to vision, Scaled Dot Product Attention (SDPA) is the backbone of most modern deep learning applications. Unfortunately, its memory and computational requirements can be prohibitive in low-resource settings. In this paper, we improve its efficiency without sacrificing its versatility. We propose three attention variants where we remove consecutive linear transformations or add a novel one, and evaluate them on a range of standard NLP and vision tasks. Our proposed models are substantially lighter than standard SDPA (and have 25-50% fewer parameters). We show that the performance cost of these changes is negligible relative to size reduction and that in one case (Super Attention) we succeed in outperforming SDPA by up to 10% while improving its speed and reducing its parameters by 25%.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR. OpenReview.net, 2015.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, NeurIPS. Curran Associates, Inc., 2017, pp. 5998–6008.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR. OpenReview.net, 2021.
- A. Mott, D. Zoran, M. Chrzanowski, D. Wierstra, and D. J. Rezende, “Towards interpretable reinforcement learning using attention augmented agents,” in Advances in Neural Information Processing Systems, NeurIPS, 2019, pp. 12 329–12 338.
- E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. F. Stewart, “RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism,” in Advances in Neural Information Processing Systems, NeurIPS, 2016, pp. 3504–3512.
- H. Touvron, T. Lavril, G. Izacard et al., “Llama: Open and efficient foundation language models,” 2023, arXiv preprint arXiv:2302.13971.
- H. Touvron, L. Martin, K. Stone et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023, arXiv preprint arXiv:2307.09288.
- J. Achiam, S. Adler, S. Agarwal et al., “GPT-4 technical report,” 2023.
- R. Anil, S. Borgeaud, Y. Wu et al., “Gemini: a family of highly capable multimodal models,” 2023, arXiv preprint arXiv:2312.11805.
- P. Dhar, “The carbon impact of artificial intelligence.” Nature Machine Intelligence, vol. 2, no. 8, pp. 423–425, 2020.
- B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, 2018, pp. 2704–2713.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in 10th International Conference on Learning Representations, ICLR. OpenReview.net, 2022.
- T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” 2023, arXiv preprint arXiv:2305.14314.
- S. Ashkboos, M. L. Croci, M. G. do Nascimento, T. Hoefler, and J. Hensman, “Slicegpt: Compress large language models by deleting rows and columns,” in 12th International Conference on Learning Representations, ICLR. OpenReview.net, 2024.
- T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” in Advances in Neural Information Processing Systems, NeurIPS, vol. 35. Curran Associates, Inc., 2022, pp. 16 344–16 359.
- T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” 2024.
- Y. LeCun, C. Cortes, C. Burges et al., “Mnist handwritten digit database,” http://yann.lecun.com/exdb/mnist, 2010, accessed: 2020-06-13.
- A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
- A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in The 49th Annual Meeting of the Association for Computational Linguistics, ACL. The Association for Computer Linguistics, 2011, pp. 142–150.
- J. Ni, J. Li, and J. McAuley, “Justifying recommendations using distantly-labeled reviews and fine-grained aspects,” in Empirical Methods in Natural Language Processing EMNLP. Association for Computational Linguistics, 2019, pp. 188–197.
- R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” 2019, arXiv preprint arXiv:1904.10509.
- I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” 2020, arXiv preprint arXiv:2004.05150.
- P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao, “Multi-scale vision longformer: A new vision tansformer for high-resolution image encoding,” in IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 2021, pp. 2978–2988.
- Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” in Advances in Neural Information Processing Systems, NeurIPS, vol. 34. Curran Associates, Inc., 2021, pp. 28 092–28 103.
- Y. Ding, H. Qin, Q. Yan, Z. Chai, J. Liu, X. Wei, and X. Liu, “Towards accurate post-training quantization for vision transformer,” in 30th ACM International Conference on Multimedia, MM. ACM, 2022, pp. 5380–5388.
- M. Nagel, M. Fournarakis, Y. Bondarenko, and T. Blankevoort, “Overcoming oscillations in quantization-aware training,” in 39th International Conference on Machine Learning, ICML, ser. Proceedings of Machine Learning Research, vol. 162. PMLR, 2022, pp. 16 318–16 330.
- P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” in 6th International Conference on Learning Representations, ICLR. OpenReview.net, 2018.
- Z. Zhang, W. Shao, J. Gu, X. Wang, and P. Luo, “Differentiable dynamic quantization with mixed precision and adaptive resolution,” in 38th International Conference on Machine Learning, ICML, vol. 139. Curran Associates, Inc., 2021, pp. 12 546–12 556.
- S. Chen, W. Wang, and S. J. Pan, “Deep neural network quantization via layer-wise optimization using limited training data,” in AAAI Conference on Artificial Intelligence. AAAI Press, 2019, pp. 3329–3336.
- S. Hong, M. Panaitescu-Liess, Y. Kaya, and T. Dumitras, “Qu-anti-zation: Exploiting quantization artifacts for achieving adversarial outcomes,” in Advances in Neural Information Processing Systems, NeurIPS. Curran Associates, Inc., 2021, pp. 9303–9316.
- K. Gupta and T. Ajanthan, “Improved gradient-based adversarial attacks for quantized networks,” in AAAI Conference on Artificial Intelligence. AAAI Press, 2022, pp. 6810–6818.
- L. Timpl, R. Entezari, H. Sedghi, B. Neyshabur, and O. Saukh, “Understanding the effect of sparsity on neural networks robustness,” 2022, arXiv preprint arXiv:2206.10915.