Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 44 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Linear Log-Normal Attention with Unbiased Concentration (2311.13541v4)

Published 22 Nov 2023 in cs.LG and cs.AI

Abstract: Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This limitation poses a substantial obstacle when dealing with long documents or high-resolution images. In this work, we study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. Furthermore, we propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention, designed to emulate the distribution and concentration behavior of the original self-attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Attention is all you need, 2017. URL http://arxiv.org/abs/1706.03762.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. BERT: pre-training of deep bidirectional transformers for language understanding, 2018. URL http://arxiv.org/abs/1810.04805.
  4. An image is worth 16x16 words: Transformers for image recognition at scale, 2020. URL https://arxiv.org/abs/2010.11929.
  5. The best of both worlds: Combining recent advances in neural machine translation, 2018. URL http://arxiv.org/abs/1804.09849.
  6. HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization, 2019. URL http://arxiv.org/abs/1905.06566.
  7. On extractive and abstractive neural document summarization with transformer language models, November 2020. URL https://aclanthology.org/2020.emnlp-main.748.
  8. Neural machine translation by jointly learning to align and translate, January 2015. 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.
  9. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  10. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Generating long sequences with sparse transformers, 2019. URL http://arxiv.org/abs/1904.10509.
  13. Big bird: Transformers for longer sequences, 2020. URL https://arxiv.org/abs/2007.14062.
  14. Rethinking attention with performers, 2020. URL https://arxiv.org/abs/2009.14794.
  15. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236.
  16. Simon Coste. The spectral gap of sparse random digraphs, 2017. URL https://arxiv.org/abs/1708.00530.
  17. Transformers are deep infinite-dimensional non-mercer binary kernel machines, 2021. URL https://arxiv.org/abs/2106.01506.
  18. Elizbar A Nadaraya. On estimating regression, 1964.
  19. Robustify transformers with robust kernel density estimation, 2022. URL https://arxiv.org/abs/2210.05794.
  20. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel, 2019. URL http://arxiv.org/abs/1908.11775.
  21. J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations, 1909.
  22. Long-short transformer: Efficient transformers for language and vision, 2021. URL https://arxiv.org/abs/2107.02192.
  23. Axial attention in multidimensional transformers, 2019. URL http://arxiv.org/abs/1912.12180.
  24. Blockwise self-attention for long document understanding, 2019. URL http://arxiv.org/abs/1911.02972.
  25. Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150.
  26. KVT: k-nn attention for boosting vision transformers. CoRR, abs/2106.00515, 2021. URL https://arxiv.org/abs/2106.00515.
  27. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768.
  28. Synthesizer: Rethinking self-attention in transformer models, 2020a. URL https://arxiv.org/abs/2005.00743.
  29. Nyströmformer: A nyström-based algorithm for approximating self-attention, 2021. URL https://arxiv.org/abs/2102.03902.
  30. Skyformer: Remodel self-attention with gaussian kernel and nyström method, 2021. URL https://arxiv.org/abs/2111.00035.
  31. cosformer: Rethinking softmax in attention, 2022a. URL https://arxiv.org/abs/2202.08791.
  32. Set transformer, 2018a. URL http://arxiv.org/abs/1810.00825.
  33. Compressive transformers for long-range sequence modelling, 2019. URL http://arxiv.org/abs/1911.05507.
  34. Random feature attention, 2021. URL https://arxiv.org/abs/2103.02143.
  35. Reformer: The efficient transformer, 2020. URL https://arxiv.org/abs/2001.04451.
  36. Efficient content-based sparse attention with routing transformers, 2020. URL https://arxiv.org/abs/2003.05997.
  37. Sparse sinkhorn attention, 2020b. URL https://arxiv.org/abs/2002.11296.
  38. Fast transformers with clustered attention, 2020. URL https://arxiv.org/abs/2007.04825.
  39. The devil in linear transformer, 2022b. URL https://arxiv.org/abs/2210.10340.
  40. Deep neural networks as gaussian processes, 2018b.
  41. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
  42. ACIQ: analytical clipping for integer quantization of neural networks. CoRR, abs/1810.05723, 2018. URL http://arxiv.org/abs/1810.05723.
  43. Leo A. Goodman. On the exact variance of products, 1960. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1960.10483369.
  44. Leslie H. Fenton. The sum of log-normal probability distributions in scatter transmission systems, 1960.
  45. Markov chains and mixing times. American Mathematical Society, 2006. URL http://scholar.google.com/scholar.bib?q=info:3wf9IU94tyMJ:scholar.google.com/&output=citation&hl=en&as_sdt=2000&ct=citation&cd=0.
  46. What does attention in neural machine translation pay attention to?, 2017. URL http://arxiv.org/abs/1710.03348.
  47. Analyzing the structure of attention in a transformer language model, 2019. URL http://arxiv.org/abs/1906.04284.
  48. C.D. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial Mathematics, 2000. URL http://www.matrixanalysis.com/DownloadChapters.html.
  49. Hans Samelson. On the perron-frobenius theorem., 1957.
  50. Broad distribution effects in sums of lognormal random variables, apr 2003. URL https://doi.org/10.1140%2Fepjb%2Fe2003-00131-6.
  51. Long range arena: A benchmark for efficient transformers, 2020c. URL https://arxiv.org/abs/2011.04006.
  52. Roberta: A robustly optimized BERT pretraining approach, 2019. URL http://arxiv.org/abs/1907.11692.
  53. An analysis of neural language modeling at multiple scales, 2018. URL https://arxiv.org/abs/1803.08240.
  54. GLUE: A multi-task benchmark and analysis platform for natural language understanding, 2018. URL http://arxiv.org/abs/1804.07461.
  55. fairseq: A fast, extensible toolkit for sequence modeling. CoRR, abs/1904.01038, 2019. URL http://arxiv.org/abs/1904.01038.
  56. Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=IxmWsm4xrua.
  57. Transformer quality in linear time, 2022.
  58. Ian Jolliffe. Principal component analysis, 2011. URL https://doi.org/10.1007/978-3-642-04898-2_455.
  59. Eric W Weisstein. Normal product distribution, 2003.
  60. Training data-efficient image transformers & distillation through attention. CoRR, abs/2012.12877, 2020. URL https://arxiv.org/abs/2012.12877.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com