Many-Shot In-Context Learning (2404.11018v3)
Abstract: LLMs excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
- An in-depth look at gemini’s language abilities. arXiv preprint arXiv:2312.11444, 2023.
- Many-shot jailbreaking. Technical report, Anthropic, 2024.
- Anthropic. The claude 3 model family: Opus, sonnet, haiku. Technical Report, 2024.
- Understanding in-context learning in transformers and llms by learning to learn discrete functions. arXiv preprint arXiv:2310.03016, 2023.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- G. Gemini Team. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- G. Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arxiv:2403.05530, 2024.
- PDDL—The Planning Domain Definition Language, 1998.
- The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Trans. Assoc. Comput. Linguistics, 10:522–538, 2022.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 4693–4703. Association for Computational Linguistics, 2021.
- M. Helmert. The fast downward planning system. Journal of Artificial Intelligence Research, 26:191–246, July 2006. ISSN 1076-9757. 10.1613/jair.1705. URL http://dx.doi.org/10.1613/jair.1705.
- In-context learning creates task vectors. arXiv preprint arXiv:2310.15916, 2023.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- V-star: Training verifiers for self-taught reasoners. arXiv preprint arXiv:2402.06457, 2024.
- An information-theoretic analysis of in-context learning. arXiv preprint arXiv:2401.15530, 2024.
- G. Kamradt. LLMTest_NeedleInAHaystack. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023. Accessed: 2024-04-16.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Self-generated in-context learning: Leveraging auto-regressive language models as a demonstration generator. CoRR, abs/2206.08082, 2022. 10.48550/ARXIV.2206.08082. URL https://doi.org/10.48550/arXiv.2206.08082.
- In-context learning learns label relationships but is not conventional learning. In The Twelfth International Conference on Learning Representations, 2023.
- In-context learning with many demonstration examples. CoRR, abs/2302.04931, 2023. 10.48550/ARXIV.2302.04931. URL https://doi.org/10.48550/arXiv.2302.04931.
- Let’s verify step by step. CoRR, abs/2305.20050, 2023. 10.48550/ARXIV.2305.20050. URL https://doi.org/10.48550/arXiv.2305.20050.
- C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
- Z. Lin and K. Lee. Dual operating modes of in-context learning. arXiv preprint arXiv:2402.18819, 2024.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8086–8098. Association for Computational Linguistics, 2022. 10.18653/V1/2022.ACL-LONG.556. URL https://doi.org/10.18653/v1/2022.acl-long.556.
- Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796, 2014.
- Rethinking the role of demonstrations: What makes in-context learning work? In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 11048–11064. Association for Computational Linguistics, 2022.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1797–1807. Association for Computational Linguistics, 2018.
- Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pages 26106–26128. PMLR, 2023.
- M. A. NLLB Team. No language left behind: Scaling human-centered machine translation. arXiv preprint, 2022.
- J. Pan. What in-context learning “learns” in-context: Disentangling task recognition and task learning. PhD thesis, Princeton University, 2023.
- Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
- Chatgpt mt: Competitive for high-(but not low-) resource languages. arXiv preprint arXiv:2309.07423, 2023.
- PDDL generators. https://doi.org/10.5281/zenodo.6382173, 2022.
- Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36, 2024.
- Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. Advances in Neural Information Processing Systems, 36, 2024.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
- Addressing order sensitivity of in-context demonstration examples in causal language models. arXiv preprint arXiv:2402.15637, 2024.
- An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
- Effective long-context scaling of foundation models. CoRR, abs/2309.16039, 2023. 10.48550/ARXIV.2309.16039. URL https://doi.org/10.48550/arXiv.2309.16039.
- Ground-truth labels matter: A deeper look into input-label demonstrations. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2422–2437. Association for Computational Linguistics, 2022.
- Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International conference on machine learning, pages 11328–11339. PMLR, 2020.
- Efficient attention via control variates. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=G-uNfHKrj46.