Parallel Structures in Pre-training Data Yield In-Context Learning (2402.12530v1)
Abstract: Pre-trained LLMs (LMs) are capable of in-context learning (ICL): they can adapt to a task with only a few examples given in the prompt without any parameter update. However, it is unclear where this capability comes from as there is a stark distribution shift between pre-training text and ICL prompts. In this work, we study what patterns of the pre-training data contribute to ICL. We find that LMs' ICL ability depends on $\textit{parallel structures}$ in the pre-training data -- pairs of phrases following similar templates in the same context window. Specifically, we detect parallel structures by checking whether training on one phrase improves prediction of the other, and conduct ablation experiments to study their effect on ICL. We show that removing parallel structures in the pre-training data reduces LMs' ICL accuracy by 51% (vs 2% from random ablation). This drop persists even when excluding common patterns such as n-gram repetitions and long-range dependency, showing the diversity and generality of parallel structures. A closer look at the detected parallel structures indicates that they cover diverse linguistic tasks and span long distances in the data.
- What learning algorithm is in-context learning? investigations with linear models. ArXiv.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems.
- Data distributional properties drive emergent few-shot learning in transformers. ArXiv.
- Meta-learning via language model in-context tuning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research.
- Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023.
- What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems.
- Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus.
- Continual pre-training of large language models: How to (re) warm your model? ArXiv.
- Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
- Understanding in-context learning via supportive pretraining data. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
- Continual learning of language models. ArXiv.
- Quantifying adaptability in pre-trained language models with 500 tasks. ArXiv.
- Xiaonan Li and Xipeng Qiu. 2023. Finding support examples for in-context learning. In Findings of the Association for Computational Linguistics: EMNLP 2023.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. ArXiv.
- Are emergent abilities in large language models just in-context learning? ArXiv.
- One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. ArXiv.
- In-context learning and induction heads. ArXiv.
- OpenAI. 2023. Gpt-4 technical report.
- Language models are unsupervised multitask learners. OpenAI blog.
- Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. ArXiv.
- In-context pretraining: Language modeling beyond document boundaries. ArXiv.
- On the effect of pretraining corpora on in-context learning by a large-scale language model. ArXiv.
- Do long-range language models actually use long-range context? In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv.
- Transformers learn in-context by gradient descent. In Proceedings of the International Conference on Machine Learning.
- Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
- Emergent abilities of large language models. ArXiv.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- An explanation of in-context learning as implicit bayesian inference. ArXiv.
- Understanding in-context learning from repetitions. ArXiv.
- C3: Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.