Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization (2402.14951v1)

Published 22 Feb 2024 in stat.ML, cs.CL, and cs.LG

Abstract: We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbf{\beta}$), in the sense that every $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbf{\beta}$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbf{\beta}$, and they highlight the role of MLP layers in reducing approximation error.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Transformers learn to implement preconditioned gradient descent for in-context learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  2. In-context learning through the Bayesian prism. In The Twelfth International Conference on Learning Representations, 2024.
  3. What learning algorithm is in-context learning? Investigations with linear models. In The Eleventh International Conference on Learning Representations, 2022.
  4. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  5. Understanding in-context learning in transformers and LLMs by learning to learn discrete functions. In The Twelfth International Conference on Learning Representations, 2024.
  6. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  7. Meta-learning via language model in-context tuning. arXiv preprint arXiv:2110.07814, 2021.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8493–8502, 2022.
  10. Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.  1126–1135. PMLR, 2017.
  13. How does representation impact in-context learning: An exploration on a synthetic task. arXiv preprint arXiv:2309.06054, 2023.
  14. What can transformers learn in-context? A case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  15. Transformer feed-forward layers are key-value memories. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
  16. How do transformers learn in-context beyond simple functions? A case study on learning with representations. In The Twelfth International Conference on Learning Representations, 2024.
  17. Supervised pretraining can learn in-context reinforcement learning. arXiv preprint arXiv:2306.14892, 2023.
  18. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pp.  19565–19594. PMLR, 2023.
  19. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. In The Twelfth International Conference on Learning Representations, 2024.
  20. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  21. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. In The Twelfth International Conference on Learning Representations, 2024.
  22. On a product of positive semidefinite matrices. Linear algebra and its applications, 295(1-3):3–6, 1999.
  23. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  24. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021.
  25. OpenAI. GPT-4 technical report, 2023.
  26. Transformers can optimally learn regression mixture models. arXiv preprint arXiv:2311.08362, 2023.
  27. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.
  28. Improving language understanding by generative pre-training, 2018.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  31. Seber, G. A. A matrix handbook for statisticians. John Wiley & Sons, 2008.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  33. Benign overfitting in ridge regression. J. Mach. Learn. Res., 24:123–1, 2023.
  34. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  35. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.  35151–35174. PMLR, 2023.
  36. Huggingface’s transformers: State-of-the-art natural language processing, 2020.
  37. How many pretraining tasks are needed for in-context learning of linear regression? In The Twelfth International Conference on Learning Representations, 2024.
  38. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.
Citations (11)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com