Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 44 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Efficient Prompting Methods for Large Language Models: A Survey (2404.01077v2)

Published 1 Apr 2024 in cs.CL

Abstract: Prompting is a mainstream paradigm for adapting LLMs to specific natural language processing tasks without modifying internal parameters. Therefore, detailed supplementary knowledge needs to be integrated into external prompts, which inevitably brings extra human efforts and computational burdens for practical applications. As an effective solution to mitigate resource consumption, Efficient Prompting Methods have attracted a wide range of attention. We provide mathematical expressions at a high level to deeply discuss Automatic Prompt Engineering for different prompt components and Prompt Compression in continuous and discrete spaces. Finally, we highlight promising future directions to inspire researchers interested in this field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. A general language assistant as a laboratory for alignment. ArXiv preprint, abs/2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
  2. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv preprint, abs/2303.12712, 2023. URL https://arxiv.org/abs/2303.12712.
  4. Recurrent memory transformer. ArXiv preprint, abs/2207.06881, 2022. URL https://arxiv.org/abs/2207.06881.
  5. Evoprompting: Language models for code-level neural architecture search. ArXiv preprint, abs/2302.14838, 2023a. URL https://arxiv.org/abs/2302.14838.
  6. Instructzero: Efficient instruction optimization for black-box large language models. ArXiv preprint, abs/2306.03082, 2023b. URL https://arxiv.org/abs/2306.03082.
  7. Adapting language models to compress contexts. ArXiv preprint, abs/2305.14788, 2023. URL https://arxiv.org/abs/2305.14788.
  8. Prompt injection: Parameterization of fixed inputs. ArXiv preprint, abs/2206.11349, 2022. URL https://arxiv.org/abs/2206.11349.
  9. RLPrompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3369–3391, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.222.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  11. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. ArXiv preprint, abs/2203.06904, 2022. URL https://arxiv.org/abs/2203.06904.
  12. A survey on in-context learning. 2022. URL https://api.semanticscholar.org/CorpusID:255372865.
  13. Extending context window of large language models via semantic compression. ArXiv preprint, abs/2312.09571, 2023. URL https://arxiv.org/abs/2312.09571.
  14. Promptbreeder: Self-referential self-improvement via prompt evolution. ArXiv preprint, abs/2309.16797, 2023. URL https://arxiv.org/abs/2309.16797.
  15. An image is worth one word: Personalizing text-to-image generation using textual inversion. ArXiv preprint, abs/2208.01618, 2022. URL https://arxiv.org/abs/2208.01618.
  16. Extensible prompts for language models on zero-shot language style customization. 2022. URL https://api.semanticscholar.org/CorpusID:254125409.
  17. In-context autoencoder for context compression in a large language model. ArXiv preprint, abs/2307.06945, 2023. URL https://arxiv.org/abs/2307.06945.
  18. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. ArXiv preprint, abs/2309.08532, 2023. URL https://arxiv.org/abs/2309.08532.
  19. Optimizing prompts for text-to-image generation. ArXiv preprint, abs/2212.09611, 2022. URL https://arxiv.org/abs/2212.09611.
  20. In-context learning creates task vectors. ArXiv preprint, abs/2310.15916, 2023. URL https://arxiv.org/abs/2310.15916.
  21. Distilling the knowledge in a neural network. ArXiv preprint, abs/1503.02531, 2015. URL https://arxiv.org/abs/1503.02531.
  22. John H. Holland. Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. 1992. URL https://api.semanticscholar.org/CorpusID:58781161.
  23. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp.  2790–2799. PMLR, 2019. URL http://proceedings.mlr.press/v97/houlsby19a.html.
  24. Automatic engineering of long prompts. ArXiv preprint, abs/2311.10117, 2023. URL https://arxiv.org/abs/2311.10117.
  25. Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  13358–13376, 2023a.
  26. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. ArXiv preprint, abs/2310.06839, 2023b. URL https://arxiv.org/abs/2310.06839.
  27. Promptkd: Distilling student-friendly knowledge for generative language models via prompt tuning, 2024.
  28. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
  29. Ode transformer: An ordinary differential equation-inspired model for neural machine translation. ArXiv preprint, abs/2104.02308, 2021. URL https://arxiv.org/abs/2104.02308.
  30. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, Online, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
  31. Compressing context to enhance inference efficiency of large language models. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:263830231.
  32. Use your instinct: Instruction optimization using neural bandits coupled with transformers. ArXiv preprint, abs/2310.02905, 2023. URL https://arxiv.org/abs/2310.02905.
  33. Tcra-llm: Token compression retrieval augmented large language model for inference cost reduction. In Conference on Empirical Methods in Natural Language Processing, 2023a. URL https://api.semanticscholar.org/CorpusID:264439519.
  34. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023b.
  35. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  61–68, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.8. URL https://aclanthology.org/2022.acl-short.8.
  36. Learning to compress prompts with gist tokens. ArXiv preprint, abs/2304.08467, 2023. URL https://arxiv.org/abs/2304.08467.
  37. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. 2024. URL https://api.semanticscholar.org/CorpusID:268531237.
  38. Hypertuning: Toward adapting large language models without back-propagation. In International Conference on Machine Learning, 2022. URL https://api.semanticscholar.org/CorpusID:253761398.
  39. GrIPS: Gradient-free, edit-based instruction search for prompting large language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  3845–3864, Dubrovnik, Croatia, 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.277.
  40. Measuring and narrowing the compositionality gap in language models. ArXiv preprint, abs/2210.03350, 2022. URL https://arxiv.org/abs/2210.03350.
  41. Automatic prompt optimization with ”gradient descent” and beam search. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:258546785.
  42. Nugget: Neural agglomerative embeddings of text. ArXiv preprint, abs/2310.01732, 2023. URL https://arxiv.org/abs/2310.01732.
  43. Improving language understanding by generative pre-training. 2018.
  44. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021. doi: 10.1162/tacl˙a˙00353. URL https://aclanthology.org/2021.tacl-1.4.
  45. Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:623–656, 1948. URL https://api.semanticscholar.org/CorpusID:55379485.
  46. Eliciting knowledge from language models using automatically generated prompts. ArXiv preprint, abs/2010.15980, 2020. URL https://arxiv.org/abs/2010.15980.
  47. Learning by distilling context. ArXiv preprint, abs/2209.15189, 2022. URL https://arxiv.org/abs/2209.15189.
  48. Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11:341–359, 1997. URL https://api.semanticscholar.org/CorpusID:5297867.
  49. Roformer: Enhanced transformer with rotary position embedding. ArXiv preprint, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  50. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971.
  51. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  52. Efficient large language models: A survey. ArXiv preprint, abs/2312.03863, 2023. URL https://arxiv.org/abs/2312.03863.
  53. Label words are anchors: An information flow perspective for understanding in-context learning. ArXiv preprint, abs/2305.14160, 2023a. URL https://arxiv.org/abs/2305.14160.
  54. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Annual Meeting of the Association for Computational Linguistics, 2023b. URL https://api.semanticscholar.org/CorpusID:258558102.
  55. Self-consistency improves chain of thought reasoning in language models. ArXiv preprint, abs/2203.11171, 2022. URL https://arxiv.org/abs/2203.11171.
  56. Emergent abilities of large language models. ArXiv preprint, abs/2206.07682, 2022a. URL https://arxiv.org/abs/2206.07682.
  57. Chain of thought prompting elicits reasoning in large language models. ArXiv preprint, abs/2201.11903, 2022b. URL https://arxiv.org/abs/2201.11903.
  58. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. ArXiv preprint, abs/2302.03668, 2023. URL https://arxiv.org/abs/2302.03668.
  59. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  5621–5634, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.findings-emnlp.412.
  60. Introduction to transformers: an nlp perspective. ArXiv preprint, abs/2311.17633, 2023. URL https://arxiv.org/abs/2311.17633.
  61. Large language models as optimizers. ArXiv preprint, abs/2309.03409, 2023. URL https://arxiv.org/abs/2309.03409.
  62. Dialclip: Empowering clip as multi-modal dialog retriever. ArXiv preprint, abs/2401.01076, 2024. URL https://arxiv.org/abs/2401.01076.
  63. Least-to-most prompting enables complex reasoning in large language models. ArXiv preprint, abs/2205.10625, 2022a. URL https://arxiv.org/abs/2205.10625.
  64. Large language models are human-level prompt engineers. ArXiv preprint, abs/2211.01910, 2022b. URL https://arxiv.org/abs/2211.01910.
Citations (18)

Summary

  • The paper categorizes efficient prompting into computation (prompt compression) and design (automatic optimization), framing it as a multi-objective optimization problem.
  • It reviews compression techniques like knowledge distillation, encoding, and filtering that reduce prompt length and computational overhead while preserving performance.
  • It discusses automatic prompt optimization via gradient-based and evolution-based methods, highlighting approaches to minimize human effort in prompt engineering.

Efficient Prompting Methods for LLMs: A Survey

"Efficient Prompting Methods for LLMs: A Survey" (2404.01077) provides a comprehensive and systematic review of methods aimed at improving the efficiency of prompt-based adaptation for LLMs. The survey is motivated by the increasing computational and human costs associated with prompt engineering, especially as prompt length and complexity grow to match the capabilities of modern LLMs. The authors categorize efficient prompting methods into two principal axes: efficient computation (prompt compression) and efficient design (automatic prompt optimization), and further abstract the field as a multi-objective optimization problem balancing computational cost and task accuracy.

Background and Motivation

Prompting has become the dominant paradigm for leveraging LLMs in downstream NLP tasks, replacing full-parameter fine-tuning with in-context learning (ICL) and instruction-based adaptation. However, as prompt complexity increases—driven by the need for demonstrations, chain-of-thought reasoning, and detailed instructions—two major challenges arise:

  • Computational Burden: Longer prompts increase memory usage and inference latency, and may exceed the context window of LLMs.
  • Human Effort: Manual prompt design is labor-intensive, highly sensitive to prompt phrasing, and lacks theoretical guidance.

The survey addresses these challenges by reviewing methods that either compress prompts to reduce computational overhead or automate prompt design to minimize human effort.

Efficient Computation: Prompt Compression

Prompt compression methods aim to reduce the length and redundancy of prompts while preserving or minimally degrading task performance. The survey identifies three main strategies:

1. Knowledge Distillation

Prompt-level knowledge distillation adapts the classic teacher-student paradigm to compress prompt information into more compact representations, often as soft prompts. Notable approaches include:

  • Context Distillation: Compresses lengthy, human-aligned prompts into internal representations, achieving comparable alignment and performance with significantly shorter prompts.
  • Prompt Compression via KL Minimization: Soft prompts are learned to minimize the divergence between the output distributions of models conditioned on long versus compressed prompts.
  • Gisting: Multi-task prompts are distilled into "gist tokens," achieving up to 26x compression with minimal performance loss.

These methods require access to model parameters for fine-tuning and often rely on synthetic data generated by teacher models.

2. Encoding

Encoding methods convert hard prompts (natural language) into learnable vector representations (soft prompts), which are then prepended to the model input. Key techniques include:

  • Textual Inversion and Extensible Prompts: Imaginary tokens are introduced to encode complex or indescribable styles, with only these tokens being updated during training.
  • AutoCompressor and ICAE: Long documents or prompts are recursively compressed into summary vectors or memory slots, which can be cached and reused, reducing both memory and computation.

Encoding approaches are particularly effective for open-source LLMs where parameter access is available, and they facilitate pre-computation and efficient retrieval.

3. Filtering

Filtering methods operate at the text-to-text level, using lightweight models to identify and remove redundant or low-utility information from prompts:

  • Selective Context: Quantifies token-level information using self-information metrics and filters out low-entropy units at the token, phrase, or sentence level.
  • LLMLingua and LLMLingua-2: Segment prompts into components (instructions, questions, demonstrations), dynamically allocate compression budgets, and iteratively compress tokens based on perplexity or binary classification objectives.

Filtering is model-agnostic and particularly valuable for closed-source LLMs, as it does not require access to model parameters.

Efficient Design: Automatic Prompt Optimization

Automatic prompt optimization seeks to replace manual prompt engineering with algorithmic search and optimization in the space of natural language prompts. The survey distinguishes between gradient-based and evolution-based approaches.

1. Gradient-Based Methods

  • Real-Gradient Tuning: For open-source models, discrete prompts are mapped to continuous embeddings, enabling optimization via gradient descent (e.g., AutoPrompt, RLPrompt, PEZ).
  • Imitated-Gradient Prompting: For black-box models, gradient information is approximated via edit-based search (GrIPS), LLM-generated feedback (APE, APO), or reinforcement learning with LLMs as reward models.

These methods have demonstrated the ability to discover high-performing prompts automatically, but are often limited by the discrete nature of language and the lack of differentiability in closed-source settings.

2. Evolution-Based Methods

Evolutionary algorithms, inspired by biological evolution, are used to explore the prompt space via mutation, crossover, and selection:

  • OPRO and EvoPrompt: LLMs are used as optimizers, generating and evaluating candidate prompts based on meta-prompts that encode optimization trajectories.
  • Promptbreeder: Prompts and mutation strategies are co-evolved, enabling self-referential improvement and increased diversity.

Evolution-based methods are particularly suited to black-box LLMs and have shown promise in optimizing both short and long prompts, though scalability to very long prompts remains an open challenge.

Theoretical Abstraction and Future Directions

The survey abstracts efficient prompting as a multi-objective optimization problem, balancing prompt compression (minimizing computational cost) and task accuracy. The authors propose that future research should focus on:

  • Information-Theoretic Filtering: Quantifying and retaining only the most beneficial information in prompts.
  • Co-Optimization of Hard and Soft Prompts: Jointly optimizing discrete and continuous representations for improved alignment and efficiency.
  • Model-Agnostic Compression: Developing methods that generalize across models and tasks, especially as closed-source LLMs become more prevalent.

Implications and Prospects

Efficient prompting methods have significant implications for both research and deployment:

  • Resource-Constrained Deployment: Compression and filtering enable LLMs to be used in environments with limited memory or compute, and reduce API costs for commercial applications.
  • Scalability: Automatic prompt optimization reduces the need for expert human intervention, facilitating rapid adaptation to new tasks and domains.
  • Alignment and Interpretability: As prompt engineering becomes more automated, understanding the relationship between prompt structure and model behavior will be critical for alignment and safety.

The survey highlights the need for continued exploration of information-theoretic, optimization-based, and hybrid approaches to efficient prompting. As LLMs continue to scale and diversify, efficient prompting will remain a central challenge for both practical deployment and theoretical understanding.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Reddit Logo Streamline Icon: https://streamlinehq.com