MiniLLM: Knowledge Distillation of Large Language Models (2306.08543v4)
Abstract: Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of LLMs. However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller LLMs. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative LLMs, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/miniLLM.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 700–710, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, March 2021.
- Language models are few-shot learners. In Proceedings of NeurIPS, 2020.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
- Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
- Language gans falling short. In International Conference on Learning Representations, 2020.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT, 2019.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Distilling policy distillation. In The 22nd international conference on artificial intelligence and statistics, pages 1331–1340. PMLR, 2019.
- Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
- Openwebtext corpus, 2019.
- Google. Bard, 2023.
- The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
- The curious case of neural text degeneration. In International Conference on Learning Representations, 2020.
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
- Reinforcement learning with deep energy-based policies. In International conference on machine learning, pages 1352–1361. PMLR, 2017.
- Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likelihood, adversary? arXiv preprint arXiv:1511.05101, 2015.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Pre-trained models: Past, present and future. AI Open, 2021.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
- Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, 2020.
- Tailoring language generation models under total variation distance. In The Eleventh International Conference on Learning Representations, 2023.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out (ACL 2004), 2004.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Improving text generation with student-forcing optimal transport. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9144–9156, 2020.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Tom Minka et al. Divergence measures and message passing. Technical report, Citeseer, 2005.
- Reverse kl-divergence training of prior networks: Improved uncertainty and adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019.
- MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023.
- Measuring calibration in deep learning. In CVPR workshops, 2019.
- OpenAI. Openai: Introducing chatgpt, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. In Proceedings of NeurIPS, 2022.
- Richard Yuanzhe Pang and He He. Text generation by learning from demonstrations. In International Conference on Learning Representations, 2021.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 759–766, 2000.
- Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023.
- Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
- Language models are unsupervised multitask learners. OpenAI Technical report, 2019.
- Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355, 2019.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
- Defining and characterizing reward gaming. In Advances in Neural Information Processing Systems, 2022.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP, October 2013.
- Lightpaff: A two-stage distillation framework for pre-training and fine-tuning. arXiv preprint arXiv:2004.12817, 2020.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Multitask prompted training enables zero-shot task generalization. In Proceedings of ICLR, 2022.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online, August 2021. Association for Computational Linguistics.
- Finetuned language models are zero-shot learners. In Proceedings of ICLR, 2022.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pages 5–32, 1992.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, 2021.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Benchmarking generalization via in-context instructions on 1,600+ language tasks. In Proceedings of EMNLP, 2022.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
- MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
- Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation. arXiv preprint arXiv:2305.05010, 2023.
- Minimum divergence vs. maximum margin: an empirical comparison on seq2seq models. In International Conference on Learning Representations, 2019.