Linear Transformers are Versatile In-Context Learners (2402.14180v2)
Abstract: Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We analyze this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Transformers learn to implement preconditioned gradient descent for in-context learning. arXiv preprint arXiv:2306.00297, 2023.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- In-Context language learning: Architectures and algorithms. arXiv preprint arXiv:2401.12973, 2024.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
- Transformers implement functional gradient descent to learn non-linear functions in context. arXiv preprint arXiv:2312.06528, 2023.
- Comparison of model selection for regression. Neural computation, 15(7):1691–1714, 2003.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Transformers learn higher-order optimization methods for in-context learning: A study with linear models. arXiv preprint arXiv:2310.17086, 2023.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
- Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
- How do transformers learn in-context beyond simple functions? a case study on learning with representations. arXiv preprint arXiv:2310.10616, 2023.
- In-context learning creates task vectors. arXiv preprint arXiv:2310.15916, 2023.
- In-context convergence of transformers. arXiv preprint arXiv:2310.05249, 2023.
- Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020.
- In-context learning in large language models learns label relationships but is not conventional learning. arXiv preprint arXiv:2307.12375, 2023.
- Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pp. 19565–19594. PMLR, 2023.
- One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023.
- Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Transformers can optimally learn regression mixture models. arXiv preprint arXiv:2311.08362, 2023.
- Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp. 9355–9366. PMLR, 2021.
- Do pretrained transformers really learn in-context by gradient descent? arXiv preprint arXiv:2310.08540, 2023.
- Max-margin token selection in attention mechanism. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. arXiv preprint arXiv:2305.16380, 2023a.
- Joma: Demystifying multilayer transformers via joint dynamics of mlp and attention. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151–35174. PMLR, 2023a.
- Uncovering mesa-optimization algorithms in transformers. arXiv preprint arXiv:2309.05858, 2023b.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
- Transformers are uninterpretable with myopic methods: a case study with bounded dyck grammars. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Pretraining data mixtures enable narrow model selection capabilities in transformer models. arXiv preprint arXiv:2311.00871, 2023.
- Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.