Emergent Mind

A Survey on Knowledge Distillation of Large Language Models

(2402.13116)
Published Feb 20, 2024 in cs.CL

Abstract

In the era of LLMs, Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-improvement. Our survey is meticulously structured around three foundational pillars: \textit{algorithm}, \textit{skill}, and \textit{verticalization} -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful guide for researchers and practitioners, offering a detailed overview of current methodologies in KD and proposing future research directions. Importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.

A pipeline distills knowledge from a large language model to a student model, as shown.

Overview

  • Knowledge Distillation (KD) presents a way to mitigate the accessibility barriers of LLMs by distilling their capabilities into more accessible models.

  • The survey reviews state-of-the-art KD techniques, highlighting algorithm innovations, skill enhancements, domain-specific applications, and challenges in the process.

  • KD algorithms focus on transferring knowledge from teacher to student models through methods like Supervised Fine-Tuning, Divergence and Similarity, Reinforcement Learning, and Ranking Optimization.

  • Future research directions include improving data selection strategies, reducing costs, overcoming catastrophic forgetting, and ensuring the trustworthiness of distilled models.

Bridging the Divide: A Comprehensive Survey on Knowledge Distillation Techniques for LLMs

Overview

The expansion of LLMs such as GPT-4, Transformer-based architectures, and their integration into various domains, signify groundbreaking advancements in artificial intelligence. However, the proprietary nature of leading LLMs and their enormous computational and financial demands present significant barriers to accessibility. Knowledge Distillation (KD) emerges as a promising methodology to mitigate these challenges, aiming to distill the advanced capabilities of LLMs into more accessible, open-source models. This survey extensively reviews the state-of-the-art in KD techniques for LLMs, emphasizing algorithm innovations, skill enhancements, domain-specific applications, and the challenges encountered in the distillation process.

Knowledge Distillation Algorithms

KD algorithms form the backbone of transferring intricate knowledge from teacher models to student models. This includes leveraging:

  • Supervised Fine-Tuning (SFT) approaches for direct learning from teacher-generated outputs.
  • Divergence and Similarity methods to minimize distributional differences or maximize feature similarities between the teacher and student models.
  • Advanced techniques like Reinforcement Learning (RL) for introducing teacher feedback into the learning process.
  • Ranking Optimization strategies to directly implement preference feedback from the teacher, enhancing the decision-making capabilities of student models.

Skill Enhancement

The distillation process seeks not merely to replicate teacher model outputs but to instill distinct cognitive abilities or "skills" in student models. This encompasses the distillation of context-following capacities for understanding and responding to complex instructions, multi-turn dialogues incorporating coherent conversation handling, and retrieval-augmented capabilities that utilize external information for informed response generation.

Domain-Specific Vertical Distillation

KD also extends to customizing LLMs for vertical domains such as law, medical and healthcare, finance, and science. Through careful curation of training data and domain-specific tuning, distilled models can achieve high performance in specialized fields, demonstrating the versatility and adaptability of KD techniques.

Challenges and Future Directions

Despite significant progress, KD of LLMs faces hurdles including:

  • Determining optimal data selection strategies to ensure the quality and relevance of distillation data.
  • Reducing distillation costs through model compression and efficient fine-tuning techniques.
  • Overcoming catastrophic forgetting to retain the model's previously acquired knowledge during distillation.
  • Ensuring the trustworthiness of distilled models, fostering models that are truthful, safe, and ethical.

The survey underscores the necessity for innovative KD methodologies that address these challenges while striving for advancements that further democratize access to the remarkable capabilities of LLMs.

Conclusion

KD of LLMs stands as a vibrant and evolving research avenue with the potential to democratize advanced AI capabilities. By enhancing algorithmic techniques, honing specific skills, tailoring models to vertical applications, and addressing emergent challenges, the research community is poised to unlock new potentials in AI accessibility and performance. As KD techniques continue to mature, they promise a future where cutting-edge AI technologies are within reach for diverse users and applications, paving the way toward a more inclusive and equitable technological landscape.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube