Is attention required for ICL? Exploring the Relationship Between Model Architecture and In-Context Learning Ability (2310.08049v3)
Abstract: What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps toward answering this question. We evaluate thirteen model architectures capable of causal LLMing across a suite of synthetic in-context learning tasks. These selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, state space model inspired, and other emerging attention alternatives. We discover that all the considered architectures can perform in-context learning under a wider range of conditions than previously documented. Additionally, we observe stark differences in statistical efficiency and consistency by varying the number of in-context examples and task difficulty. We also measure each architecture's predisposition towards in-context learning when presented with the option to memorize rather than leverage in-context examples. Finally, and somewhat surprisingly, we find that several attention alternatives are sometimes competitive with or better in-context learners than transformers. However, no single architecture demonstrates consistency across all tasks, with performance either plateauing or declining when confronted with a significantly larger number of in-context examples than those encountered during gradient-based training.
- What learning algorithm is in-context learning? investigations with linear models, 2023.
- Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE, 4(11):e7678, November 2009. ISSN 1932-6203. doi: 10.1371/journal.pone.0007678. URL http://dx.doi.org/10.1371/journal.pone.0007678.
- Using fast weights to attend to the recent past, 2016.
- Language models are few-shot learners, 2020.
- Data distributional properties drive emergent in-context learning in transformers, 4 2022. URL http://arxiv.org/abs/2205.05055v6.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation, 2014.
- Tinystories: How small can language models be and still speak coherent english?, 2023.
- Hungry hungry hippos: Towards language modeling with state space models, 2023.
- What can transformers learn in-context? a case study of simple function classes, 2023.
- Hippo: Recurrent memory with optimal polynomial projections, 2020.
- Efficiently modeling long sequences with structured state spaces, 10 2021. URL http://arxiv.org/abs/2111.00396v3.
- Deep residual learning for image recognition, 2015.
- Long short-term memory. Neural Computation, 9:1735–1780, 1997. URL https://api.semanticscholar.org/CorpusID:1915014.
- Scaling laws for neural language models, 2020.
- Transformers are rnns: Fast autoregressive transformers with linear attention, 2020.
- Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 284–294, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1027. URL https://aclanthology.org/P18-1027.
- The omniglot challenge: a 3-year progress report, 2019.
- What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
- Rethinking the role of demonstrations: What makes in-context learning work?, 2022.
- In-context learning and induction heads, 9 2022. URL http://arxiv.org/abs/2209.11895v1.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- Rwkv: Reinventing rnns for the transformer era, 5 2023. URL http://arxiv.org/abs/2305.13048v1.
- Hyena hierarchy: Towards larger convolutional language models, 2 2023. URL http://arxiv.org/abs/2302.10866v3.
- Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
- Language models are unsupervised multitask learners, 2019. URL https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe.
- Learning representations by back-propagating errors. Nature, 323:533–536, 1986. URL https://api.semanticscholar.org/CorpusID:205001834.
- Noam Shazeer. Glu variants improve transformer, 2020.
- Rigid-motion scattering for texture classification, 2014.
- Roformer: Enhanced transformer with rotary position embedding, 2022.
- A length-extrapolatable transformer, 2022.
- Retentive network: A successor to transformer for large language models, 7 2023. URL http://arxiv.org/abs/2307.08621v1.
- Long range arena: A benchmark for efficient transformers, 2020.
- Efficient transformers: A survey. ACM Comput. Surv., 55(6), dec 2022a. ISSN 0360-0300. doi: 10.1145/3530811. URL https://doi.org/10.1145/3530811.
- Are pre-trained convolutions better than pre-trained transformers?, 2022b.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need, 2017.
- Transformers learn in-context by gradient descent, 12 2022. URL http://arxiv.org/abs/2212.07677v2.
- Larger language models do in-context learning differently, 2023.
- Huggingface’s transformers: State-of-the-art natural language processing, 2020.
- Pay less attention with lightweight and dynamic convolutions, 2019.
- An explanation of in-context learning as implicit bayesian inference, 2021. URL https://arxiv.org/abs/2111.02080.
- An attention free transformer, 2021.
- Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 12697–12706. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/zhao21c.html.