Emergent Mind

In-context Learning with Retrieved Demonstrations for Language Models: A Survey

(2401.11624)
Published Jan 21, 2024 in cs.CL , cs.AI , and cs.IR

Abstract

Language models, especially pre-trained LLMs, have showcased remarkable abilities as few-shot in-context learners (ICL), adept at adapting to new tasks with just a few demonstrations in the input context. However, the model's ability to perform ICL is sensitive to the choice of the few-shot demonstrations. Instead of using a fixed set of demonstrations, one recent development is to retrieve demonstrations tailored to each input query. The implementation of demonstration retrieval is relatively straightforward, leveraging existing databases and retrieval systems. This not only improves the efficiency and scalability of the learning process but also has been shown to reduce biases inherent in manual example selection. In light of the encouraging results and growing research in ICL with retrieved demonstrations, we conduct an extensive review of studies in this area. In this survey, we discuss and compare different design choices for retrieval models, retrieval training procedures, and inference algorithms.

Overview

  • The survey paper reviews retrieval-based in-context learning (RetICL), which dynamically selects demonstrations tailored to each query to optimize language model performance.

  • RetICL enhances LLMs in few-shot learning scenarios by focusing on the relevance and usefulness of demonstrations without updating model parameters.

  • Various RetICL strategies like one-hoc, clustering, and iterative retrieval are discussed, each employing different methods of demonstration selection to improve query alignment.

  • RetICL faces challenges around corpus creation, retriever selection, and integration of advanced training methods but has shown efficacy in tasks ranging from QA to text generation.

  • The paper suggests that future research should address existing issues and deepen theoretical understanding, pointing to RetICL's potential to advance AI, particularly in resource-constrained settings.

Introduction

Few-shot in-context learning (ICL) is a capability of LLMs to adapt to new tasks using a limited number of demonstrations. This capability eludes the need for task-specific fine-tuning, presenting several advantages including resource efficiency and mitigating overfitting. Traditional ICL approaches use fixed demonstrations for all queries, leading to suboptimal use of the LLMs' potential. This survey provides an exhaustive review of a burgeoning variation: Retrieval-based ICL (RetICL), where tailored demonstrations are selected dynamically for each query to optimize model performance.

Few-shot In-context Learning for Language Models

LLMs have been pre-eminent in handling few-shot learning scenarios where models infer based on a handful of demonstrations without parameter updating. Despite significant strides, the success of ICL hinges on the quality, quantity and diversity of demonstrations. This calls for techniques that shift from static to dynamic, query-oriented demonstration selection. RetICL intends to maximize the relevance and usefulness of these demonstrations by considering key elements such as retrieval model complexity, diversity of the retrieval corpus and the retriever’s objectives during the selection process.

In-context Learning with Demonstration Retrieval

RetICL involves selecting demonstrations in alignment with the input query. Various strategies materialize this objective, with methods spanning one-hoc retrieval, clustering retrieval, and iterative retrieval. They differ in how demonstrations are selected—either independently, through clustering for diversity, or iteratively to build upon the context of previously chosen demonstrations. The retrieval corpus is equally vital and can range from in-domain, mix-domain, and cross-domain to raw text and unlabelled queries. Advanced RetICL techniques employ fine-tuning of retrieval models to curate training data, with the focus shifting towards training objectives that can range from InfoNCE loss to distillation by KL divergence, aiming for relevancy and diversity.

Applications and Future Directions

RetICL has established efficacy across several task categories including natural language understanding, reasoning, knowledge-based QA, and text generation. Challenges persist in corpus creation, retriever choice, training methods, and the need for active learning integration. Future research must resolve these challenges while also enhancing our theoretical comprehension of why similar demonstrations derived from retrieval methods outperform random selection, and how RetICL can be adapted to smaller models without sacrificing performance.

The insights gleaned from RetICL expose the remarkable yet untapped potential of LLMs when demonstrations are chosen with an informed, context-sensitive retrieval strategy. The exploration of this domain is poised to refine our understanding and utilization of LLMs, pushing the boundaries of AI and its application within complex, resource-constrained environments. This survey encapsulates the current state while guiding future developments in the evolving landscape of in-context learning.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. ComQA: A community-sourced dataset for complex factoid question answering with paraphrase clusters. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 307–317, Minneapolis, Minnesota. Association for Computational Linguistics.
  2. In-context Examples Selection for Machine Translation
  3. Ben Taskar Alex Kulesz. 2012. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5:123–286.
  4. Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics, 8:556–571.
  5. A neural probabilistic language model. Advances in neural information processing systems, 13.
  6. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544.
  7. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  8. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642.
  9. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 632–642.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  11. Christopher J.C. Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Technical report, Microsoft Research Technical Report MSR-TR-2010-82.
  12. SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation
  13. Improving In-Context Few-Shot Learning via Self-Supervised Training
  14. UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation
  15. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  16. Scaling Instruction-Finetuned Language Models
  17. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
  18. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  19. Training Verifiers to Solve Math Word Problems
  20. Towards Teachable Reasoning Systems: Using a Dynamic Memory of User Feedback for Continual System Improvement
  21. Finding contradictions in text. In Proceedings of acl-08: Hlt, pages 1039–1047.
  22. QLoRA: Efficient Finetuning of Quantized LLMs
  23. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  24. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics, pages 350–es.
  25. Semantic noise matters for neural natural language generation. In Proceedings of the 12th International Conference on Natural Language Generation, pages 421–426.
  26. Complexity-Based Prompting for Multi-Step Reasoning
  27. Ambiguity-Aware In-Context Learning with Large Language Models
  28. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
  29. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598.
  30. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009.
  31. Demix layers: Disentangling domains for modular language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576.
  32. Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning
  33. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  34. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  35. Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations.
  36. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  37. In-context learning for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2627–2643.
  38. Can Machines Learn Morality? The Delphi Experiment
  39. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438
  40. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611.
  41. James H. Jurafsky, Dan; Martin. 2021. N-gram language models. In Speech and Language Processing (3rd ed.).
  42. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
  43. Understanding Finetuning for Factual Knowledge Extraction from Language Models
  44. Lambada: Backward chaining for automated reasoning in natural language. In ACL.
  45. Looking beyond the surface:a challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL).
  46. Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP
  47. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
  48. Large Language Models are Zero-Shot Reasoners
  49. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3):123–286.
  50. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  51. Measuring Faithfulness in Chain-of-Thought Reasoning
  52. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
  53. Diverse Demonstrations Improve In-context Compositional Generalization
  54. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857.
  55. Mtop: A comprehensive multilingual task-oriented semantic parsing benchmark. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2950–2962.
  56. Unified Demonstration Retriever for In-Context Learning
  57. Finding Support Examples for In-Context Learning
  58. MoT: Memory-of-Thought Enables ChatGPT to Self-Improve
  59. Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models
  60. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  61. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167.
  62. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114
  63. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  64. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
  65. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations.
  66. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098.
  67. Dr.ICL: Demonstration-Retrieved In-context Learning
  68. Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations
  69. Fairness-guided Few-shot Prompting for Large Language Models
  70. Memory-assisted prompt editing to improve GPT-3 after deployment
  71. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
  72. Efficient Estimation of Word Representations in Vector Space
  73. Metaicl: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809.
  74. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
  75. Diversity of Thought Improves Reasoning Abilities of LLMs
  76. Dart: Open-domain structured data record to text generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 432–447.
  77. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  78. Large Dual Encoders Are Generalizable Retrievers
  79. Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages
  80. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  81. Totto: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1173–1186.
  82. Language Models as Knowledge Bases?
  83. Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations.
  84. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  85. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  86. SQuAD: 100,000+ Questions for Machine Comprehension of Text
  87. Nils Reimers and Iryna Gurevych. 2019a. Sentence-bert: Sentence embeddings using siamese bert-networks. In Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982––3992.
  88. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
  89. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  90. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  91. Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8):1270–1278.
  92. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671.
  93. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
  94. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304:114135.
  95. RetICL: Sequential Retrieval of In-Context Examples with Reinforcement Learning
  96. Xricl: Cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-sql semantic parsing. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5248–5259.
  97. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
  98. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  99. Can Language Models be Biomedical Knowledge Bases?
  100. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
  101. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158.
  102. Attention is all you need. Advances in neural information processing systems, 30.
  103. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR.
  104. Better Zero-Shot Reasoning with Self-Adaptive Prompting
  105. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
  106. Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
  107. Text embeddings by weakly-supervised contrastive pre-training
  108. Learning to Retrieve In-Context Examples for Large Language Models
  109. Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning
  110. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  111. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  112. "According to ...": Prompting Language Models Improves Quoting from Pre-Training Data
  113. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
  114. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics, 8:183–198.
  115. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations.
  116. mT5: A massively multilingual pre-trained text-to-text transformer
  117. Large Language Models as Optimizers
  118. Multilingual Universal Sentence Encoder for Semantic Retrieval
  119. Compositional Exemplars for In-context Learning
  120. Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following
  121. Xue Ying. 2019. An overview of overfitting and its solutions. In Journal of physics: Conference series, volume 1168, page 022022. IOP Publishing.
  122. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
  123. TaskLAMA: Probing the Complex Task Understanding of Language Models
  124. John M Zelle and Raymond J Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the national conference on artificial intelligence, pages 1050–1055.
  125. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800
  126. Instruction Tuning for Large Language Models: A Survey
  127. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
  128. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9134–9148.
  129. PAWS: Paraphrase Adversaries from Word Scrambling. In Proc. of NAACL.
  130. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations.
  131. A Survey of Large Language Models
  132. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.

Show All 132