Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

A Survey on Knowledge Distillation of Large Language Models (2402.13116v4)

Published 20 Feb 2024 in cs.CL

Abstract: In the era of LLMs, Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-improvement. Our survey is meticulously structured around three foundational pillars: \textit{algorithm}, \textit{skill}, and \textit{verticalization} -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful guide for researchers and practitioners, offering a detailed overview of current methodologies in KD and proposing future research directions. Importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.

References (389)

Citations (48)

View on Semantic Scholar

Summary

The paper reviews state-of-the-art KD algorithms that transfer complex knowledge from teacher LLMs to compact, efficient student models.
It demonstrates how techniques enhance student model skills including context understanding, multi-turn dialogue, and retrieval augmentation.
It addresses domain-specific distillation challenges and proposes strategies to democratize advanced AI while ensuring model trustworthiness.

Bridging the Divide: A Comprehensive Survey on Knowledge Distillation Techniques for LLMs

Overview

The expansion of LLMs such as GPT-4, Transformer-based architectures, and their integration into various domains, signify groundbreaking advancements in artificial intelligence. However, the proprietary nature of leading LLMs and their enormous computational and financial demands present significant barriers to accessibility. Knowledge Distillation (KD) emerges as a promising methodology to mitigate these challenges, aiming to distill the advanced capabilities of LLMs into more accessible, open-source models. This survey extensively reviews the state-of-the-art in KD techniques for LLMs, emphasizing algorithm innovations, skill enhancements, domain-specific applications, and the challenges encountered in the distillation process.

Knowledge Distillation Algorithms

KD algorithms form the backbone of transferring intricate knowledge from teacher models to student models. This includes leveraging:

Supervised Fine-Tuning (SFT) approaches for direct learning from teacher-generated outputs.
Divergence and Similarity methods to minimize distributional differences or maximize feature similarities between the teacher and student models.
Advanced techniques like Reinforcement Learning (RL) for introducing teacher feedback into the learning process.
Ranking Optimization strategies to directly implement preference feedback from the teacher, enhancing the decision-making capabilities of student models.

Skill Enhancement

The distillation process seeks not merely to replicate teacher model outputs but to instill distinct cognitive abilities or "skills" in student models. This encompasses the distillation of context-following capacities for understanding and responding to complex instructions, multi-turn dialogues incorporating coherent conversation handling, and retrieval-augmented capabilities that utilize external information for informed response generation.

Domain-Specific Vertical Distillation

KD also extends to customizing LLMs for vertical domains such as law, medical and healthcare, finance, and science. Through careful curation of training data and domain-specific tuning, distilled models can achieve high performance in specialized fields, demonstrating the versatility and adaptability of KD techniques.

Challenges and Future Directions

Despite significant progress, KD of LLMs faces hurdles including:

Determining optimal data selection strategies to ensure the quality and relevance of distillation data.
Reducing distillation costs through model compression and efficient fine-tuning techniques.
Overcoming catastrophic forgetting to retain the model's previously acquired knowledge during distillation.
Ensuring the trustworthiness of distilled models, fostering models that are truthful, safe, and ethical.

The survey underscores the necessity for innovative KD methodologies that address these challenges while striving for advancements that further democratize access to the remarkable capabilities of LLMs.

Conclusion

KD of LLMs stands as a vibrant and evolving research avenue with the potential to democratize advanced AI capabilities. By enhancing algorithmic techniques, honing specific skills, tailoring models to vertical applications, and addressing emergent challenges, the research community is poised to unlock new potentials in AI accessibility and performance. As KD techniques continue to mature, they promise a future where cutting-edge AI technologies are within reach for diverse users and applications, paving the way toward a more inclusive and equitable technological landscape.

PDF Markdown

GitHub

GitHub - Tebmer/Awesome-Knowledge-Distillation-of-LLMs: This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". It includes KD Algorithms (Labeling, Self-Knowledge, Self-Rewarding, SFT, etc.) leveraging data augmentation or synthesis for getting knowledge, Skill Distillation (e.g. instruction following), and specialized applications (law, healthcare). (313 stars)

Tweets

https://twitter.com/fly51fly/status/1760439006884741544

https://twitter.com/_reachsumit/status/1760201427841065016

https://twitter.com/AsafYehudai/status/1914069827326460339

https://twitter.com/manoelribeiro/status/1886500382895145401

https://twitter.com/niconi_niconi/status/1811836941991841818

https://twitter.com/madjid_tehrani/status/1885051795698827752

YouTube

Show All Videos