MiniLLM: Knowledge Distillation of Large Language Models (2306.08543v4)

Published 14 Jun 2023 in cs.CL and cs.AI

Abstract: Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of LLMs. However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller LLMs. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative LLMs, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/miniLLM.

References (74)

Citations (65)

View on Semantic Scholar

Summary

The paper introduces a reverse KLD-based distillation method that optimizes generative student models by avoiding the pitfalls of standard forward KLD.
The methodology incorporates single-step decomposition, teacher-mixed sampling, and length normalization to reduce variance and improve calibration.
Experimental results show that MiniLLMs achieve enhanced performance metrics, sometimes surpassing teacher models in Rouge-L and GPT-4 evaluations.

Analyzing "MiniLLM: Knowledge Distillation of LLMs"

The paper "MiniLLM: Knowledge Distillation of LLMs" explores the under-explored field of knowledge distillation (KD) applied to LLMs, presenting a method to distill LLMs' knowledge into smaller, computationally efficient models. This process aims to maintain the generative prowess of the original models while easing resource demands, a necessity with the proliferation of open-source LLMs.

Key Contributions and Methodology

The authors propose a novel approach that substitutes the standard forward Kullback-Leibler divergence (KLD) in KD with reverse KLD. This transition is crucial for generative LLMs as it prevents the student model from inaccurately assigning high probabilities to low-probability regions distributed by the teacher model. This methodological shift addresses the issue where the complexity of LLM applications surpasses the expressive capacity of smaller student models.

The paper outlines a robust optimization strategy leveraging policy gradient techniques to effectuate this reverse KLD minimization. The method introduces several enhancements:

Single-Step Decomposition: Reduces training variance by isolating single-step generation quality.
Teacher-Mixed Sampling: Reduces reward hacking by incorporating the teacher model's distribution during sampling.
Length Normalization: Addresses sequence length bias, promoting optimal response length during generation.

These intentional strategies forge an effective KD paradigm for LLMs resulting in the proposed models, termed MiniLLMs.

Experimental Validation

Extensive experiments substantiate the MiniLLM framework's advantages:

MiniLLMs exhibit superior performance across various instruction-following evaluations, spanning models with parameters ranging from 120M to 13B.
Analysis shows pragmatic improvements with reduced exposure bias and enhanced calibration. Notably, in many cases, distilled models exceeded teacher-model performance as quantified by metrics like Rouge-L and GPT-4 feedback.
Further tests reveal consistent student model performance enhancements correlated with increasing teacher model sizes, indicating scalability.

Implications and Future Directions

The research underscores the potential of reverse KLD in knowledge distillation for LLMs, presenting promising opportunities for deploying efficient, small-scale models. This advancement could catalyze more widespread application of LLM capabilities with reduced computational overhead. Additionally, its implications for methodologically optimizing model efficiency bear significance in both academic and industrial contexts.

Looking forward, this work establishes a basis for further exploration into distribution metrics beyond reverse KLD and their impact on KD efficacy. Continuing this line of inquiry could foster innovative KD methodologies suitable for evolving complexities in AI applications, ultimately refining our understanding and implementation of scalable LLM technologies.

In summary, this paper articulates a significant refinement to traditional KD strategies, paving the way for deploying LLM-caliber capabilities more broadly and efficiently. This contribution is expected to influence both the theoretical underpinnings and practical deployment of AI-driven language solutions.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/jeremiahjjohns/status/1884452572343525851

YouTube

Show All Videos