Emergent Mind

Distilling the Knowledge in a Neural Network

(1503.02531)
Published Mar 9, 2015 in stat.ML , cs.LG , and cs.NE

Abstract

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

Overview

  • The paper details a method called 'distillation' to compress knowledge from an extensive model ensemble into a single, small neural network.

  • Soft targets, derived from a larger model's output probabilities, better inform small networks' training, enhancing generalization with fewer examples.

  • Experiments with MNIST and speech recognition show that distilled models match or exceed the performance of larger models or ensembles.

  • Training of specialized models on subsets of data pairs with distillation to improve performance while reducing computational demands.

  • The approach is scalable, with specialist models suggesting a new, more efficient way to architect 'mixture of experts' models.

Introduction

In "Distilling the Knowledge in a Neural Network," Hinton, Vinyals, and Dean explore an effective method to compress the knowledge of an extensive ensemble of models into a singular, concise neural network. They propose the utilization of a technique they term "distillation" to achieve this compression. The core idea revolves around transferring the generalization capabilities of cumbersome models, which might comprise of a collection of models or a single extensive regularized model, into a smaller, more deployable network.

Knowledge Distillation

The paper introduces the concept of a "soft target," which is derived from the output probabilities of a cumbersome model, and used to guide the training of the smaller model. Unlike "hard targets," which represent the definitive class labels, soft targets encapsulate the probability distribution over classes provided by the larger model. High entropy in soft targets conveys nuanced information between classes and helps the smaller network generalize better with fewer training examples and a potentially higher learning rate.

Employing a technique known as "distillation," the authors suggest training the smaller network at a higher "temperature" using the softmax layer, effectively smoothing the probability distribution and providing richer guidance to the small network. Once trained, the small model uses a conventional temperature of 1, effectively sharpening its predictions. This approach has been shown to be quite effective, as evidenced in the experiments with the MNIST dataset and speech recognition models.

Experimental Results

In the context of MNIST, impressive results emerged when using distillation, showcasing that a well-generalizing model can be created without the need of an extensive representative transfer set. It also argued that even when a class is entirely omitted during training, the distilled model performs astoundingly well in classifying that very class. For speech recognition tasks, the experiments demonstrated that the distillation of knowledge from an ensemble into a single DNN acoustic model preserves the ensemble's performance enhancement.

Large-Scale Application and Specialized Models

The authors extend their methodology to large-scale image datasets, demonstrating that training specialist models focused on subsets of classes not only reduced overall computational expense but also improved the performance of the greater model, particularly when paired with the distillation process. These specialists models are trained in parallel rapidly, elucidating on the scalability of the approach.

The paper concludes by aligning the use of specialist models with the rationale behind "mixture of experts" models, though with significant advantages in terms of parallelizability at training and a streamlined selection process at inference time. They advocate for training specialists with both soft and hard targets to prevent overfitting, an essential consideration given the smaller effective training set size for specialists.

In summary, this paper establishes the distillation technique as a powerful approach for transferring knowledge from cumbersome models to smaller models, maintaining performance while reducing deployment complexity and computational costs. The results presented promise significant improvements for deploying complex machine learning models across various domains, from image recognition to speech processing.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube