Zero-Shot Knowledge Distillation in Deep Networks

Published 20 May 2019 in cs.LG, cs.CV, and stat.ML | (1905.08114v1)

Abstract: Knowledge distillation deals with the problem of training a smaller model (Student) from a high capacity source model (Teacher) so as to retain most of its performance. Existing approaches use either the training data or meta-data extracted from it in order to train the Student. However, accessing the dataset on which the Teacher has been trained may not always be feasible if the dataset is very large or it poses privacy or safety concerns (e.g., bio-metric or medical data). Hence, in this paper, we propose a novel data-free method to train the Student from the Teacher. Without even using any meta-data, we synthesize the Data Impressions from the complex Teacher model and utilize these as surrogates for the original training data samples to transfer its learning to Student via knowledge distillation. We, therefore, dub our method "Zero-Shot Knowledge Distillation" and demonstrate that our framework results in competitive generalization performance as achieved by distillation using the actual training data samples on multiple benchmark datasets.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (236)

View on Semantic Scholar

Summary

The paper demonstrates a data-free approach by synthesizing Data Impressions directly from a teacher model to enable knowledge distillation.
It employs Dirichlet modeling to generate surrogate data that accurately mimics softmax output constraints for class probabilities.
Empirical evaluations on datasets like MNIST and CIFAR-10 show that student models closely match traditional performance despite the absence of real training data.

Zero-Shot Knowledge Distillation in Deep Networks

The presented paper explores the innovative paradigm of Zero-Shot Knowledge Distillation (ZSKD), contributing to the field of deep learning by addressing the challenge of training compact student models in the absence of training data. The authors propose a framework where knowledge distillation, a process traditionally reliant on access to large training datasets, is performed without any actual data samples. This work is predicated on synthesizing pseudo data, referred to as Data Impressions (DI), from a trained teacher model.

Core Contributions

The paper introduces and details several key ideas:

Zero-Data Synthesis: The authors shift from traditional data-dependent methods to a data-free approach, where Data Impressions are synthesized directly from the teacher model itself. They utilize the parameters of the teacher model to reconstruct a probability distribution of the data, thus generating surrogate data representations.
Data Impressions (DI): These are crafted by sampling from a Dirichlet distribution that models the class probabilities expected from the teacher. The concentration parameters of this distribution are informed by class similarities extracted from the weights of the teacher, allowing for a more nuanced reconstruction of data mimicking the samples on which the teacher was trained.
Dirichlet Modelling: The paper explores the use of Dirichlet distributions for sampling output class probabilities, capitalizing on the naturally occurring constraints of such distributions to ensure that synthesized outputs sum to one and maintain positivity, characteristics intrinsic to softmax outputs.
Empirical Evaluation: A rigorous experimental setup demonstrates the effectiveness of the ZSKD framework, evaluated across various models and datasets such as MNIST, Fashion MNIST, and CIFAR-10. The results indicate that even without direct access to original data, the student models achieve performance levels approaching those of conventional methods utilizing full datasets.

Numerical Results and Implications

The application of ZSKD is shown to be robust, with student model performance significantly exceeding current benchmarks in zero-data scenarios and closely trailing traditional data-heavy methodologies. The MNIST dataset, for instance, sees student models achieving up to 98.77% accuracy with ZSKD using only generated Data Impressions, compared to 99.25% with full data distillation. This clear narrowing of the performance gap highlights the viability of data-free learning where training sets are large, proprietary, or confidential.

Theoretical and Practical Implications

The theoretical foundation laid by the paper opens multiple pathways for future research and application. By eliminating the dependency on original training data, ZSKD enables scenarios where data sharing is restricted due to privacy or proprietary constraints. It also suggests broader applications in fields with stringent data circulation policies, such as healthcare and biometrics.

Practically, the method can significantly reduce the computational and logistical overhead required in deploying AI systems, notably in resource-constrained environments like mobile or edge computing. Future research may focus on refining the synthesis process, improving the Markov Chain Monte Carlo sampling techniques involved, or integrating additional network interpretability approaches to further enrich the quality of the synthesized data impressions.

The paper's advancements suggest a promising future for knowledge distillation in constrained data environments, setting a foundation for other researchers to explore optimizations and variations of the zero-data paradigm in AI development.

Markdown Report Issue