The Transient Nature of Emergent In-Context Learning in Transformers (2311.08360v3)
Abstract: Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g. through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is treated largely as a persistent phenomenon; namely, once ICL emerges, it is assumed to persist asymptotically. Here, we show that the emergence of ICL during transformer training is, in fact, often transient. We train transformers on synthetic data designed so that both ICL and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to "overtrain" transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits.
- Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
- Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850. PMLR, 2016.
- Jane X Wang. Meta-learning in natural and artificial intelligence. Current Opinion in Behavioral Sciences, 38:90–95, 2021.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- Data distributional properties drive emergent in-context learning in transformers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 18878–18891. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/77c6ccacfd9962e2307fc64680fc5ace-Paper-Conference.pdf.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
- The learnability of in-context learning. arXiv preprint arXiv:2303.07895, 2023.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Llama: Open and efficient foundation language models, 2023.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
- In-context learning distillation: Transferring few-shot learning ability of pre-trained language models, 2022.
- Tinystories: How small can language models be and still speak coherent english?, 2023.
- Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015. doi: 10.1126/science.aab3050. URL https://www.science.org/doi/abs/10.1126/science.aab3050.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- George Kingsley Zipf. Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books, 1949.
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. Technical Report arXiv:2201.02177, arXiv, January 2022. URL http://arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs] type: article.
- Progress measures for grokking via mechanistic interpretability, 2023.
- The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon, June 2022. URL http://arxiv.org/abs/2206.04817. arXiv:2206.04817 [cs, math].
- Omnigrok: Grokking beyond algorithmic data, 2023.
- Transformer feed-forward layers are key-value memories, 2021.
- Locating and editing factual associations in gpt, 2023.
- What can transformers learn in-context? a case study of simple function classes. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30583–30598. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c529dba08a146ea8d6cf715ae8930cbe-Paper-Conference.pdf.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022.
- True few-shot learning with language models, 2021.
- Steven T. Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic bulletin & review, 21(5):1112–1130, October 2014. ISSN 1069-9384. doi: 10.3758/s13423-014-0585-6. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4176592/.
- Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.
- Pythia: A suite for analyzing large language models across training and scaling, 2023.
- The lottery ticket hypothesis: Training pruned neural networks. CoRR, abs/1803.03635, 2018. URL http://arxiv.org/abs/1803.03635.
- Sparse and continuous attention mechanisms, 2020.
- Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, dec 2020. doi: 10.1162/tacl_a_00306. URL https://doi.org/10.1162%2Ftacl_a_00306.
- Overcoming a theoretical limitation of self-attention, 2022.
- Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
- Imagenet large scale visual recognition challenge, 2015.