Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 164 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications (2310.00867v3)

Published 2 Oct 2023 in cs.CL and cs.AI

Abstract: Compressing LLMs often leads to reduced performance, especially for knowledge-intensive tasks. In this work, we dive into how compression damages LLMs' inherent knowledge and the possible remedies. We start by proposing two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after LLM compression, hence necessitating the compressed model to (re)learn from data with additional parameters; the other presumes that knowledge is internally displaced and hence one requires merely "inference re-direction" with input-side augmentation such as prompting, to recover the knowledge-related performance. Extensive experiments are then designed to (in)validate the two conjectures. We observe the promise of prompting in comparison to model tuning; we further unlock prompting's potential by introducing a variant called Inference-time Dynamic Prompting (IDP), that can effectively increase prompt diversity without incurring any inference overhead. Our experiments consistently suggest that compared to the classical re-training alternatives such as LoRA, prompting with IDP leads to better or comparable post-compression performance recovery, while saving the extra parameter size by 21x and reducing inference latency by 60%. Our experiments hence strongly endorse the conjecture of "knowledge displaced" over "knowledge forgotten", and shed light on a new efficient mechanism to restore compressed LLM performance. We additionally visualize and analyze the different attention and activation patterns between prompted and re-trained models, demonstrating they achieve performance recovery in two different regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Semantic parsing on freebase from question-answer pairs. In Conference on Empirical Methods in Natural Language Processing, 2013. URL https://api.semanticscholar.org/CorpusID:6401679.
  2. Piqa: Reasoning about physical commonsense in natural language, 2019.
  3. Frugalgpt: How to use large language models while reducing cost and improving performance. ArXiv, abs/2305.05176, 2023. URL https://api.semanticscholar.org/CorpusID:258564349.
  4. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
  5. Sparsegpt: Massive language models can be accurately pruned in one-shot. ArXiv, abs/2301.00774, 2023. URL https://api.semanticscholar.org/CorpusID:255372747.
  6. Gptq: Accurate post-training quantization for generative pre-trained transformers. ArXiv, abs/2210.17323, 2022. URL https://api.semanticscholar.org/CorpusID:253237200.
  7. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  8. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, 2019. URL https://api.semanticscholar.org/CorpusID:59599816.
  9. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
  10. Accelerated sparse neural training: A provable and efficient method to find n: M transposable masks. ArXiv, abs/2102.08124, 2021a. URL https://api.semanticscholar.org/CorpusID:231934142.
  11. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, 2021b. URL https://api.semanticscholar.org/CorpusID:235825979.
  12. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
  13. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:233296808.
  14. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), abs/2101.00190, 2021. URL https://api.semanticscholar.org/CorpusID:230433941.
  15. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:260815690.
  16. Pointer sentinel mixture models, 2016.
  17. OpenAI. Gpt-4 technical report, 2023.
  18. The lambada dataset, August 2016. URL https://doi.org/10.5281/zenodo.2630551.
  19. Adapterfusion: Non-destructive task composition for transfer learning. ArXiv, abs/2005.00247, 2020. URL https://api.semanticscholar.org/CorpusID:218470208.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  21. Learning multiple visual domains with residual adapters. ArXiv, abs/1705.08045, 2017. URL https://api.semanticscholar.org/CorpusID:215826266.
  22. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  23. High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:257495837.
  24. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/CorpusID:257219404.
  25. Crowdsourcing multiple choice science questions. ArXiv, abs/1707.06209, 2017. URL https://api.semanticscholar.org/CorpusID:1553193.
  26. Smoothquant: Accurate and efficient post-training quantization for large language models. ArXiv, abs/2211.10438, 2022. URL https://api.semanticscholar.org/CorpusID:253708271.
  27. Compress, then prompt: Improving accuracy-efficiency trade-off of llm inference with transferable prompt. ArXiv, abs/2305.11186, 2023. URL https://api.semanticscholar.org/CorpusID:258823240.
  28. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers, 2022.
  29. Hellaswag: Can a machine really finish your sentence?, 2019.
  30. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022. URL https://api.semanticscholar.org/CorpusID:248496292.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.