Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

MiniLLM: Knowledge Distillation of Large Language Models (2306.08543v4)

Published 14 Jun 2023 in cs.CL and cs.AI

Abstract: Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of LLMs. However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller LLMs. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative LLMs, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/miniLLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 700–710, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  3. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021.
  6. Language models are few-shot learners. In Proceedings of NeurIPS, 2020.
  7. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
  8. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
  9. Language gans falling short. In International Conference on Learning Representations, 2020.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT, 2019.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Distilling policy distillation. In The 22nd international conference on artificial intelligence and statistics, pages 1331–1340. PMLR, 2019.
  15. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
  16. Openwebtext corpus, 2019.
  17. Google. Bard, 2023.
  18. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
  19. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020.
  20. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
  21. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pages 1352–1361. PMLR, 2017.
  22. Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
  23. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  24. Pre-trained models: Past, present and future. AI Open, 2021.
  25. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  26. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, 2020.
  27. Tailoring language generation models under total variation distance. In The Eleventh International Conference on Learning Representations, 2023.
  28. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  29. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020.
  30. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  31. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016.
  32. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  33. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out (ACL 2004), 2004.
  34. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  35. Improving text generation with student-forcing optimal transport. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9144–9156, 2020.
  36. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  37. Tom Minka et al. Divergence measures and message passing. Technical report, Citeseer, 2005.
  38. Reverse kl-divergence training of prior networks: Improved uncertainty and adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019.
  39. MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023.
  40. Measuring calibration in deep learning. In CVPR workshops, 2019.
  41. OpenAI. Openai: Introducing chatgpt, 2022.
  42. OpenAI. Gpt-4 technical report, 2023.
  43. Training language models to follow instructions with human feedback. In Proceedings of NeurIPS, 2022.
  44. Richard Yuanzhe Pang and He He. Text generation by learning from demonstrations. In International Conference on Learning Representations, 2021.
  45. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  46. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 759–766, 2000.
  47. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023.
  48. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
  49. Language models are unsupervised multitask learners. OpenAI Technical report, 2019.
  50. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355, 2019.
  51. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  52. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  53. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
  54. Defining and characterizing reward gaming. In Advances in Neural Information Processing Systems, 2022.
  55. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  56. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, October 2013.
  57. Lightpaff: A two-stage distillation framework for pre-training and fine-tuning. arXiv preprint arXiv:2004.12817, 2020.
  58. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  59. Multitask prompted training enables zero-shot task generalization. In Proceedings of ICLR, 2022.
  60. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  61. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  62. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  63. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online, August 2021. Association for Computational Linguistics.
  64. Finetuned language models are zero-shot learners. In Proceedings of ICLR, 2022.
  65. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pages 5–32, 1992.
  66. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, 2021.
  67. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  68. Benchmarking generalization via in-context instructions on 1,600+ language tasks. In Proceedings of EMNLP, 2022.
  69. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  70. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  71. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402, 2023.
  72. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  73. Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation. arXiv preprint arXiv:2305.05010, 2023.
  74. Minimum divergence vs. maximum margin: an empirical comparison on seq2seq models. In International Conference on Learning Representations, 2019.
Citations (65)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel reverse KLD approach for knowledge distillation that mitigates overestimation of low-probability outputs in generative tasks.
  • It employs optimization techniques such as single-step decomposition and teacher-mixed sampling to stabilize training and improve model convergence.
  • The method scales from 120M to 13B parameters, achieving superior performance and calibration by reducing exposure bias in student models.

MiniLLM: Knowledge Distillation of LLMs

The paper "MiniLLM: Knowledge Distillation of LLMs" introduces a novel approach to distilling LLMs into smaller, more efficient ones through a method known as knowledge distillation (KD). The work specifically targets the need for more effective strategies in white-box KD, where the teacher model's distributional outputs are leveraged for distillation, aiming to overcome the limitations of existing techniques primarily used in classification tasks or with black-box model APIs.

Reverse Kullback-Leibler Divergence

The central innovation in this research is the replacement of the traditional forward Kullback-Leibler divergence (KLD) with reverse KLD as the objective for the KD process. This shift addresses the problem of overestimating low-probability regions of the teacher distribution, a common issue when using forward KLD in generative tasks. Reverse KLD encourages the student model to focus on key modes of the target distribution, thus optimizing model performance in core areas without unnecessary complexity. Figure 1

Figure 1: Comparison between sequence-level KD (left) and MiniLLM (right). Sequence-level KD forces the student to memorize all samples generated by the teacher model, while MiniLLM improves its generated texts with the teacher model's feedback.

Optimization and Algorithmic Enhancements

To efficiently implement the reverse KLD objective, the authors propose several optimization strategies. The approach involves policy gradient methods for deriving the objective's gradient, facilitating precise learning of the intended distribution. Key strategies include:

  • Single-Step Decomposition: Reduces training variance by focusing on individual generation quality at each step.
  • Teacher-Mixed Sampling: Mitigates reward hacking by blending teacher and student distributions during sampling.
  • Length Normalization: Corrects for biases towards short outputs by normalizing reward accumulations over sequence length.

These contributions stabilize training and improve convergence, ensuring the student model learns efficiently from the teacher's guidance.

Scalability and Performance

The experiments demonstrate that MiniLLM consistently delivers superior performance across models of varying sizes, from 120M to 13B parameters, within instruction-following tasks. The method's scalability is evidenced by its ability to maintain or exceed teacher-level performance. This is especially notable in terms of precision, measured by Rouge-L scores, where MiniLLM sometimes even surpasses its teacher models, thanks to reduced exposure biases. Figure 2

Figure 2

Figure 2: The scaling law of teacher model based on the OPT family models. We compare MiniLLM and SeqKD with OPT-1.3M as the student and OPT 2.7B, 6.7B, and 13B as teachers.

Implications for Calibration and Exposure Bias

MiniLLM's focus on optimizing reverse KLD results in better-calibrated models with lower exposure bias. Calibration improvements are significant, particularly in avoiding the pitfalls of standard KD, which often distort probability distributions due to forward KLD. Moreover, the research suggests a promising direction towards effectively calibrating student models without sacrificing the generative diversity necessary for robust NLP applications. Figure 3

Figure 3: The excess error caused by the training-decoding discrepancy (ExAccErr) accumulated with the generation length. Lower ExAccErr means the method introduces less exposure bias.

Conclusion

MiniLLM represents a significant advancement in the field of KD for LLMs, providing a scalable, efficient approach to reduce the computational requirements of deploying large models. By rethinking the divergence metric and optimizing the training procedure, the methodology not only enhances performance and scalability but also addresses fundamental challenges in model calibration and bias. This work sets the stage for broader applications of distillation techniques, potentially influencing future AI systems' design and deployment strategies.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com