Knowledge Distillation from Internal Representations

Published 8 Oct 2019 in cs.CL | (1910.03723v2)

Abstract: Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.

Abstract PDF Upgrade to Chat

Citations (169)

View on Semantic Scholar

Summary

The paper proposes distilling internal representations, specifically self-attention matrices via KL-Divergence and [CLS] token hidden states via Cosine Similarity, to improve knowledge transfer.
Experiments on GLUE demonstrate that internal distillation significantly improves student model performance while achieving a 50% parameter reduction compared to BERT_base.
This method offers significant implications for deploying performant yet parameter-reduced models in resource-constrained environments and preserving generalization capabilities.

An Analytical Overview of "Knowledge Distillation from Internal Representations"

The paper "Knowledge Distillation from Internal Representations" presents an innovative approach to enhancing the efficacy of knowledge distillation (KD) processes in machine learning, particularly through distillation of internal representations within transformer-based models. This work builds on the foundational concept of KD, which compresses a large, cumbersome model (the teacher) into a smaller model (the student) by training the student to mimic the teacher's output probabilities. This method traditionally utilizes output probabilities as soft labels to train the student model. However, the authors address an inherent limitation in this approach: the potential discrepancy between the internal representations of the teacher and those of the student, which can adversely affect generalization capabilities.

Methodology and Approach

The authors propose a novel technique whereby the internal representations from a large model such as BERT are distilled into a simplified version, capturing more abstract, linguistic properties encoded within the teacher model. This is achieved through the introduction of two key components:

KL-Divergence Loss: Applied across self-attention matrices, this loss function minimizes the divergence between predicted self-attention probabilities of the teacher and student across all attention heads. It captures critical linguistic knowledge inherent within the probability distributions.
Cosine Similarity Loss: Evaluated using the [CLS] token's hidden vector representation, this loss ensures that the context representations moving through the network layers in the student model are consistent with those of the teacher, further aiding the learning of complex abstractions.

Beyond simply distilling knowledge at the classification level, the authors explore the efficacy of distilling internal representations progressively through multiple strategies, including:

Progressive Internal Distillation (PID): Layer-wise learning starts from the bottom layers of the teacher model and moves upwards, focusing on each layer singularly until the model reaches the classification layer.
Stacked Internal Distillation (SID): Builds on the progressive approach by cumulatively stacking loss terms from all past layers during training, further solidifying knowledge acquisition and compression.

Results and Implications

Experiments conducted on GLUE benchmark datasets reveal noteworthy advances in student model performance when utilizing internal distillation methods compared to traditional soft-label KD. Particularly, BERT\textsubscript{6} trained with internal KD outperformed BERT\textsubscript{6} trained with traditional methods across multiple tasks, while maintaining a significant reduction in parameters—about 50% fewer compared to BERT\textsubscript{base}. Moreover, the internally distilled models exhibited robust generalization, effectively learning complex linguistic abstractions despite fewer layers, as evidenced by statistical significance in improvements.

These findings point toward promising implications in resource-constrained environments, where deploying reduced parameter models without sacrificing performance is essential. Furthermore, teaching students internal representations of the teachers seems to ensure that learned generalization capabilities are preserved—a critical insight for applications in transfer learning and model compression without substantial loss of internal knowledge.

Future Directions

While the paper underscores strong performance gains through internal KD, future work could explore broader applications beyond BERT, potentially extending insights to other architectures like sequence-to-sequence models or specialized domains. Another interesting direction involves exploring hybrid methods combining internal KD with other compression techniques such as quantization or pruning, which might enhance compression efficacy further.

The introduction of intricate algorithms for distilling internal knowledge sets a new precedent in model compression studies, showcasing a sophisticated, layered approach that leverages deep internal network abstractions for more effective learning. The implications for both theoretical development in model understanding and practical deployment across constrained computational setups are substantial, positioning internal representation distillation as a transformative step in advancing AI model training methodologies.

Markdown Report Issue