Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (2310.05736v2)

Published 9 Oct 2023 in cs.CL and cs.LG

Abstract: LLMs have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between LLMs. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. 2023. Sharegpt. https://sharegpt.com/.
  2. Types of out-of-distribution texts and how to detect them. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10687–10701, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  3. Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations.
  4. Harrison Chase. 2022. LangChain.
  5. Adapting language models to compress contexts. ArXiv preprint, abs/2305.14788.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168.
  8. Language modeling is compression. ArXiv preprint, abs/2309.10668.
  9. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
  10. Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning.
  11. OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
  12. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. ArXiv preprint, abs/2305.17306.
  13. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations.
  14. Extensible prompts for language models. ArXiv preprint, abs/2212.00616.
  15. In-context autoencoder for context compression in a large language model. ArXiv preprint, abs/2307.06945.
  16. Semantic compression with large language models. ArXiv preprint, abs/2304.12512.
  17. Power-bert: Accelerating BERT inference via progressive word-vector elimination. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3690–3699. PMLR.
  18. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  19. Gyuwan Kim and Kyunghyun Cho. 2021. Length-adaptive transformer: Train once with length drop, use anytime with search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6501–6511, Online. Association for Computational Linguistics.
  20. Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 784–794.
  21. Yucheng Li. 2023. Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering. ArXiv preprint, abs/2304.12102.
  22. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  23. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  24. Self-supervised losses for one-class textual anomaly detection. ArXiv preprint, abs/2204.05695.
  25. AdapLeR: Speeding up inference by adaptive length reduction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–15, Dublin, Ireland. Association for Computational Linguistics.
  26. Learning to compress prompts with gist tokens. ArXiv preprint, abs/2304.08467.
  27. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  28. Richard Clark Pasco. 1976. Source coding algorithms for fast data compression. Ph.D. thesis, Citeseer.
  29. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems.
  30. Jorma J Rissanen. 1976. Generalized kraft inequality and arithmetic coding. IBM Journal of research and development, 20(3):198–203.
  31. Claude E Shannon. 1951. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64.
  32. Ilya Sutskever. 2023. A theory of unsupervised learning. https://simons.berkeley.edu/talks/ilya-sutskever-openai-2023-08-14.
  33. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261.
  34. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  35. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  36. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  37. Multi-level knowledge distillation for out-of-distribution detection in text. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Long Papers).
  38. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning.
  39. Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244.
  40. Inference with reference: Lossless acceleration of large language models. ArXiv preprint, abs/2304.04487.
  41. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5754–5764.
  42. Mlcopilot: Unleashing the power of large language models in solving machine learning tasks. ArXiv preprint, abs/2304.14979.
  43. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  44. Efficient prompting via dynamic in-context learning. ArXiv preprint, abs/2305.11170.
Citations (74)

Summary

  • The paper introduces LLMLingua, a novel method that compresses prompts for accelerated inference while preserving performance.
  • The approach combines a budget controller, token-level iterative compression, and distribution alignment to maintain semantic integrity.
  • Experimental results on multiple datasets demonstrate that the method achieves up to 20x compression with minimal degradation in reasoning and task performance.

Overview of the Paper LLMLingua: Compressing Prompts for Accelerated Inference of LLMs

Introduction

The paper authored by Huiqiang Jiang et al., titled “LLMLingua: Compressing Prompts for Accelerated Inference of LLMs,” addresses a vital challenge in the efficient utilization of LLMs: the computational demand associated with lengthy prompts. In modern AI applications, prompts can exceed tens of thousands of tokens, making computational efficiency a crucial factor. The authors propose LLMLingua, a comprehensive prompt compression strategy aimed at improving inference speed and reducing costs without compromising performance.

Methodology

LLMLingua is structured around three key components: a budget controller, a token-level iterative compression algorithm, and an instruction tuning method for distribution alignment.

Budget Controller

The budget controller allocates different compression ratios to various prompt segments (instructions, demonstrations, and questions). Initially, it dynamically allocates compression ratios based on the significance of each segment, ensuring crucial information is retained. The controller ranks and selectively retains demonstrations based on their perplexity, which is computed using a small LLM.

Token-Level Iterative Compression

The iterative compression algorithm addresses interdependence issues between tokens by segmenting the prompt and iteratively compressing at the token level. This method considers conditional probabilities to ensure minimal loss of semantic integrity. Unlike single-pass compression, this iterative approach refines the compression process to maintain the coherence of the prompt.

Distribution Alignment

To bridge the discrepancy between the small LLM used for compression and target LLMs, the authors introduce an instruction tuning method. This involves fine-tuning the small LLM using data generated by the target LLM to achieve better alignment in the compression process.

Experimental Results

The efficacy of LLMLingua was validated on four diverse datasets: GSM8K and BBH for reasoning and in-context learning (ICL), ShareGPT for conversations, and Arxiv-March23 for summarization. The results were impressive, demonstrating state-of-the-art performance with up to 20x compression ratios and minimal performance loss.

  1. GSM8K and BBH: These datasets focus on mathematical and logical reasoning. The experiments showed that LLMLingua managed to maintain reasoning capabilities even at high compression ratios (up to 20x), with performance close to the full-shot prompts.
  2. ShareGPT and Arxiv-March23: For conversational and summarization tasks, the method also performed exceptionally well under different compression constraints, achieving a high BERTScore F1 while substantially reducing prompt length.

The paper also detailed an extensive ablation paper validating the significance of each component in the proposed system. The inclusion of an iterative token-level compression and a budget controller was shown to substantially improve the performance and robustness of the compression technique.

Implications and Future Directions

The practicality of LLMLingua extends beyond computational savings. By significantly reducing the token length required for prompting, this method allows LLMs to process longer contexts and efficiently handle more extensive inputs, a critical advantage for real-world applications like automated conversation agents and document summarization systems.

Looking ahead, the research proposes avenues like integrating compression mechanisms directly within LLMs and developing adaptive compression ratios that dynamically adjust based on the prompt’s context. Another potential direction is the exploration of guided generation techniques where models can proactively suggest optimal compressed prompts, further enhancing efficiency.

Conclusion

This paper presents a significant advancement in the optimization of LLM inference through the novel technique of prompt compression. LLMLingua systematically maintains the integrity and performance of compressed prompts, demonstrating its utility across various domain-specific tasks. Future advancements could see even more refined approaches to compression, thereby pushing the boundaries of efficient and scalable AI deployment.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 11 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube