Emergent Mind

An Overview of Neural Network Compression

(2006.03669)
Published Jun 5, 2020 in cs.LG and stat.ML

Abstract

Overparameterized networks trained to convergence have shown impressive performance in domains such as computer vision and natural language processing. Pushing state of the art on salient tasks within these domains corresponds to these models becoming larger and more difficult for machine learning practitioners to use given the increasing memory and storage requirements, not to mention the larger carbon footprint. Thus, in recent years there has been a resurgence in model compression techniques, particularly for deep convolutional neural networks and self-attention based networks such as the Transformer. Hence, this paper provides a timely overview of both old and current compression techniques for deep neural networks, including pruning, quantization, tensor decomposition, knowledge distillation and combinations thereof. We assume a basic familiarity with deep learning architectures\footnote{For an introduction to deep learning, see ~\citet{goodfellow2016deep}}, namely, Recurrent Neural Networks~\citep[(RNNs)][]{rumelhart1985learning,hochreiter1997long}, Convolutional Neural Networks~\citep{fukushima1980neocognitron}~\footnote{For an up to date overview see~\citet{khan2019survey}} and Self-Attention based networks~\citep{vaswani2017attention}\footnote{For a general overview of self-attention networks, see ~\citet{chaudhari2019attentive}.},\footnote{For more detail and their use in natural language processing, see~\citet{hu2019introductory}}. Most of the papers discussed are proposed in the context of at least one of these DNN architectures.

Overview

  • The paper explores dynamic pruning methods, inspired by the Lottery Ticket Hypothesis, to optimize the efficiency of neural networks before and during training.

  • It examines various techniques such as SNIP, Magnitude-based Pruning, Deep Rewiring, Sparse Evolutionary Training, Dynamic Sparse Reparameterization, and SparseMomentum for dynamic compression.

  • The paper critiques common beliefs about pruning, suggesting that the architecture rather than the retained weights significantly impacts performance, and proposes viewing pruning as a method for architecture search.

Dynamic Compression: Sparsifying Neural Networks Efficiently

In deep learning, the challenge of optimizing the efficiency and deployment of large neural networks is ever-present. This paper explore the intriguing realm of dynamic pruning, an approach initially inspired by the Lottery Ticket Hypothesis (LTH). Let’s break down the innovations and practical implications it brings to neural network pruning.

Dynamic Compression Before Training

A key takeaway from the paper is the exploration of prune-at-initialization methods, particularly inspired by the Lottery Ticket Hypothesis (LTH). LTH suggests that within a large, trained network, there exist smaller subnetworks or "winning tickets" that, when initialized correctly, can achieve comparable performance.

  • SNIP: Introduced by Lee et al., this approach assesses the importance of individual connections at initialization. It creates a sparse mask based on this evaluation, ensuring that only the most critical connections are trained.

Dynamic Compression During Training

When it comes to training-time pruning, it's fascinating to see how various techniques dynamically adapt the network structure.

  • Magnitude-based Pruning: Zhu et al. suggest gradually increasing the sparsity ratio throughout training, recalculating which connections to prune at each step.
  • Deep Rewiring (DeepR): Bellec et al.'s method adapts by periodically pruning and regrowing the network connections, which can be computationally intensive but allows for high flexibility.
  • Sparse Evolutionary Training (SET): This approach uses simple heuristics for deciding which weights to prune and regrow.
  • Dynamic Sparse Reparameterization (DSR): Mostafa et al. introduce a method that redistributes sparsity levels among layers based on loss gradients. It's a bit like reallocating resources to where they're needed most.
  • SparseMomentum: This approach brings a twist by utilizing the momentum magnitudes of layers to guide the prune-redistribute-regrow cycle.

Dynamic Model Pruning

One standout method explored in the paper is Dynamic Pruning proposed by Lin et al.

Dynamic Pruning

Lin et al.'s technique addresses the need for efficient model compression that doesn't come with significant overhead. It dynamically allocates sparsity, and interestingly, weights that might have been pruned prematurely can be reactivated if they prove to be important later on. Their method showed state-of-the-art performance on datasets like CIFAR-10 and ImageNet, often surpassing other pruning approaches.

Key elements of their strategy include:

  • Stochastic Gradient: Applied to both the pruned and a simultaneously maintained dense model, allowing the network to recover from potential pruning mistakes.
  • Error Compensation: By considering the error introduced by pruning, their method can adjust weights more effectively.

This approach not only boosts performance but also simplifies the often cumbersome process of retraining pruned models.

Rethinking the Value of Network Pruning

The paper also provides a thought-provoking critique of common beliefs about pruning. Here are some interesting observations:

  • Training From Scratch: For many state-of-the-art structured pruning algorithms, fine-tuning a pruned model does not outperform training a similarly sized model from scratch with random initialization.
  • Architecture Wins: The pruned architecture itself seems to be the major contributor to performance, not the retained "important" weights from the larger model. This suggests pruning could be used as a form of architecture search, which is a shift in how we think about the purpose of pruning.

These insights hint that we might sometimes overestimate the necessity of starting with a large model when a smaller, well-architected one could be just as effective.

Implications and Future Directions

This paper underscores the dynamic nature of pruning and offers multiple strategies to enhance it. Here's what it means for practical applications and future research:

Practical Applications:

  • Efficient Deployment: These methods can make deploying deep neural networks on resource-constrained devices much more feasible.
  • Training Efficiency: Dynamic compression can save computational resources during training by focusing on the most critical parts of the network.

Theoretical Implications:

  • Architecture Search: Seeing pruning as a method for architecture search could open new avenues for automated and efficient neural network design.

Future Speculations:

  • More Adaptable Methods: Future research might focus on developing even more adaptable pruning mechanisms that can seamlessly adjust to changing network demands and tasks.
  • Hybrid Approaches: Combining dynamic pruning with other techniques like neural architecture search (NAS) could yield more powerful and efficient models.

In essence, this paper encourages rethinking how we approach network pruning and dynamic compression, potentially leading to more efficient and effective deep learning models. Keep an eye on this space—there's plenty more to uncover!

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.