Emergent Mind

DistiLLM: Towards Streamlined Distillation for Large Language Models

(2402.03898)
Published Feb 6, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., LLMs) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods.

Comparison of training time for generation methods, highlighting adaptive off-policy approach's significant efficiency.

Overview

  • DistiLLM is a knowledge distillation framework for LLMs focused on efficiency and reduced computational cost.

  • It introduces a skew Kullback-Leibler divergence loss function and an adaptive off-policy approach to address common KD challenges.

  • The framework achieves faster convergence and outperforms existing KD methods in speed and student model performance.

  • DistiLLM shows state-of-the-art results on various generative tasks and is particularly suitable for use in computational resource-constrained environments.

Overview of DistiLLM Framework

DistiLLM is a knowledge distillation (KD) framework designed to efficiently transfer knowledge from LLMs to smaller counterparts. The framework addresses two significant challenges: the absence of a standardized objective function and high computational costs associated with student-generated outputs (SGO) during training.

Introduction to Knowledge Distillation Challenges

The primary goal of KD is to condense the knowledge of a cumbersome teacher model into a more agile student model, preserving performance while reducing computational load. Despite its potential, KD for LLMs has faced hurdles due to non-standardized loss functions and disparities between training and inference data distributions, known as exposure bias. These challenges have led to suboptimal results, particularly for generative tasks, where student models fail to adequately capture the complexity of the teacher's output distribution, resulting in either overly concentrated or over-smoothed distributions.

Innovations in DistiLLM

The DistiLLM framework presents two innovations: a skew Kullback-Leibler (KLD) divergence loss and an adaptive off-policy approach. The skew KLD introduces a parameter that skews the mixing between teacher and student distributions, theoretically optimizing stability and convergence. Empirical results indicate faster convergence and superior performance compared to conventional KLD approaches.

The adaptive off-policy approach efficiently leverages SGOs while managing the risk of noisy feedback and reducing the computational burden. By adaptively adjusting the reliance on SGOs based on model performance insights, DistiLLM achieves substantial training speed improvements—up to 4.3 times faster than recent KD methods—without compromising the student model's capabilities.

Empirical Validation and Performance

Extensive experiments on tasks such as instruction-following, text summarization, and machine translation validate the efficacy of DistiLLM. Not only does it achieve state-of-the-art performance for student LLMs across a variety of generative tasks, but it also offers a much-needed speedup in training time. Particularly notable is its ability to consistently outperform existing KD methodologies while operating within constrained computational budgets.

Conclusion

The DistiLLM framework significantly advances the efficient distillation of LLMs. It not only overcomes the previous challenges associated with KD but also sets a new standard in producing capable and efficient smaller language models. Its dual focus on effective knowledge transfer and training efficiency renders it instrumental for broader adoption and deployment of LLMs in resource-limited environments.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.