Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MEND: Meta dEmonstratioN Distillation for Efficient and Effective In-Context Learning (2403.06914v2)

Published 11 Mar 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated impressive in-context learning (ICL) capabilities, where a LLM makes predictions for a given test input together with a few input-output pairs (demonstrations). Nevertheless, the inclusion of demonstrations leads to a quadratic increase in the computational overhead of the self-attention mechanism. Existing solutions attempt to distill lengthy demonstrations into compact vectors. However, they often require task-specific retraining or compromise LLM's in-context learning performance. To mitigate these challenges, we present Meta dEmonstratioN Distillation (MEND), where a LLM learns to distill any lengthy demonstrations into vectors without retraining for a new downstream task. We exploit the knowledge distillation to enhance alignment between MEND and LLM, achieving both efficiency and effectiveness simultaneously. MEND is endowed with the meta-knowledge of distilling demonstrations through a two-stage training process, which includes meta-distillation pretraining and fine-tuning. Comprehensive evaluations across seven diverse ICL task partitions using decoder-only (GPT-2) and encoder-decoder (T5) attest to MEND's prowess. It not only matches but often outperforms the Vanilla ICL as well as other state-of-the-art distillation models, while significantly reducing the computational demands. This innovation promises enhanced scalability and efficiency for the practical deployment of LLMs

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Recurrent memory transformer, 2022.
  4. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  5. A survey on in-context learning, 2023.
  6. Accelerate: Training and inference at scale made simple, efficient and adaptable, 2022.
  7. Hypernetworks. ArXiv, abs/1609.09106, 2016. URL https://api.semanticscholar.org/CorpusID:208981547.
  8. Structured prompting: Scaling in-context learning to 1,000 examples, 2022.
  9. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  10. Hint: Hypernetwork instruction tuning for efficient zero-shot generalisation. arXiv preprint arXiv:2212.10315, 2022.
  11. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  12. Unifiedqa: Crossing format boundaries with a single qa system, 2020.
  13. The power of scale for parameter-efficient prompt tuning, 2021.
  14. What makes good in-context examples for gpt-3333?, 2021.
  15. Metaicl: Learning to learn in context, 2022a.
  16. Rethinking the role of demonstrations: What makes in-context learning work?, 2022b.
  17. Learning to compress prompts with gist tokens. 2023.
  18. Transformerlens, 2022. URL https://github.com/neelnanda-io/TransformerLens.
  19. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  20. Pytorch: An imperative style, high-performance deep learning library, 2019.
  21. Hypertuning: Toward adapting large language models without back-propagation. In International Conference on Machine Learning, pp.  27854–27875. PMLR, 2023.
  22. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  24. Learning by distilling context. arXiv preprint arXiv:2209.15189, 2022.
  25. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning, 2023.
  26. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  27. Learning to generate task-specific adapters from task description, 2021.
  28. Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835, 2021.
  29. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022. URL https://api.semanticscholar.org/CorpusID:248496292.
Citations (6)

Summary

We haven't generated a summary for this paper yet.