MLPs Learn In-Context on Regression and Classification Tasks (2405.15618v3)
Abstract: In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique haLLMark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context comparably with Transformers under the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging prior arguments against MLPs' ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs in a synthetic setting, and support the growing interest in all-MLP alternatives to Transformer architectures. It remains unclear how MLPs perform against Transformers at scale on real-world tasks, and where a performance gap may originate. We encourage further exploration of these architectures in more complex settings to better understand the potential comparative advantage of attention-based schemes.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Are emergent abilities in large language models just in-context learning? arXiv preprint arXiv:2309.01809, 2023.
- A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- What learning algorithm is in-context learning? investigations with linear models. In International Conference on Learning Representations, 2024.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
- Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.
- Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. In International Conference on Learning Representations, 2024.
- Asymptotic theory of in-context learning by linear attention. arXiv preprint arXiv:2405.11751, 2024.
- Scaling mlps: A tale of inductive bias. Advances in Neural Information Processing Systems, 36, 2024.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Richard Sutton. The bitter lesson. Incomplete Ideas (blog), 13(1):38, 2019.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
- Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. Advances in Neural Information Processing Systems, 36, 2024.
- Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
- Relational constraints on neural networks reproduce human biases towards abstract geometric regularity. arXiv preprint arXiv:2309.17363, 2023.
- Burrhus Frederic Skinner. Are theories of learning necessary? Psychological review, 57(4):193, 1950.
- Sensitivity to geometric shape regularity in humans and baboons: A putative signature of human singularity. Proceedings of the National Academy of Sciences, 118(16):e2023123118, 2021.
- Pay attention to mlps. Advances in neural information processing systems, 34:9204–9215, 2021.
- pnlp-mixer: An efficient all-mlp architecture for language. arXiv preprint arXiv:2202.04350, 2022.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- How many pretraining tasks are needed for in-context learning of linear regression? In International Conference on Learning Representations, 2024.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems, 36, 2024.
- Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pages 19565–19594. PMLR, 2023.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2021.
- In-context learning through the bayesian prism. In International Conference on Learning Representations, 2024.
- Exploring the relationship between model architecture and in-context learning ability. In International Conference on Learning Representations, 2024.
- Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
- Relational reasoning and generalization using nonsymbolic neural networks. Psychological Review, 130(2):308, 2023.
- The relational bottleneck as an inductive bias for efficient abstraction. arXiv preprint arXiv:2309.06629, 2023.
- A review of computational models of basic rule learning: The neural-symbolic debate and beyond. Psychonomic bulletin & review, 26:1174–1194, 2019.
- When can transformers reason with abstract symbols? arXiv preprint arXiv:2310.09753, 2023.
- Gary F Marcus. Rethinking eliminative connectionism. Cognitive psychology, 37(3):243–282, 1998.
- Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988.
- Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- Andrew Y Ng. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78, 2004.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Some intriguing aspects about lipschitz continuity of neural networks. arXiv preprint arXiv:2302.10886, 2023.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Rule learning by seven-month-old infants. Science, 283(5398):77–80, 1999.
- Not-so-clevr: learning same–different relations strains feedforward neural networks. Interface focus, 8(4):20180011, 2018.
- Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.
- Thomas Serre. Deep learning: the good, the bad, and the ugly. Annual review of vision science, 5:399–426, 2019.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax.
- Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021.
- The pandas development team. pandas-dev/pandas: Pandas, February 2020. URL https://doi.org/10.5281/zenodo.3509134.