Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning (2407.02211v2)

Published 2 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advances in fine-tuning LLMs have greatly enhanced their usage in domain-specific tasks. Despite the success, fine-tuning continues to rely on repeated and lengthy prompts, which escalate computational expenses, require more resources, and lead to slower inference. In this paper, we present a novel approach, PromptIntern, which internalizes prompt knowledge during model fine-tuning to achieve efficient inference and save costs. Instead of compressing the prompts for a vanilla model, PromptIntern aims to embed the recurrent prompt directly into the model parameters. We design a fine-tuning pipeline that includes instruction template compression, few-shot example absorption, and a progressive internalization strategy, effectively diminishing the need for intricate prompts during inference. Comprehensive experiments on challenging NL2Code tasks demonstrate that our method reduces input tokens by more than 90%, accelerates inference by 4.2 times, and reduces monetary inference costs by 88.3%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Decoupling knowledge from memorization: Retrieval-augmented prompt learning. In Advances in Neural Information Processing Systems.
  4. Batch prompting: Efficient inference with large language model apis. arXiv preprint arXiv:2301.08721.
  5. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations.
  6. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
  7. Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32.
  8. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
  9. Conline: Complex code generation and refinement with online searching and correctness testing. arXiv preprint arXiv:2403.13583.
  10. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR.
  11. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  12. Active few-shot fine-tuning. arXiv preprint arXiv:2402.15441.
  13. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  14. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736.
  15. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
  16. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  17. Brian Lester and Rami Al-Rfou Noah Constant. The power of scale for parameter-efficient prompt tuning.
  18. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  19. Compressing context to enhance inference efficiency of large language models. arXiv preprint arXiv:2310.06201.
  20. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  21. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353.
  22. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68.
  23. Wizardcoder: Empowering code large language models with evol-instruct. Preprint, arXiv:2306.08568.
  24. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. arXiv preprint arXiv:2305.16938.
  25. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36.
  26. Retrieval-based prompt selection for code-related few-shot learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2450–2462. IEEE.
  27. R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2(5).
  28. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. arXiv preprint arXiv:2403.12968.
  29. Residual prompt tuning: improving prompt tuning with residual reparameterization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6740–6757.
  30. Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.
  31. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671.
  32. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285.
  33. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
  34. Tap4llm: Table provider on sampling, augmenting, and packing semi-structured data for large language model reasoning. arXiv preprint arXiv:2312.09039.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  36. Attention is all you need. Advances in neural information processing systems, 30.
  37. Fine tuning llm for enterprise: Practical guidelines and recommendations. arXiv preprint arXiv:2404.10779.
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  39. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634.
  40. Recomp: Improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations.
  41. Large language models meet nl2code: A survey. arXiv preprint arXiv:2212.09420.
  42. Nl2formula: Generating spreadsheet formulas from natural language queries. arXiv preprint arXiv:2402.14853.
  43. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Citations (6)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com