Magic for the Age of Quantized DNNs (2403.14999v1)

Published 22 Mar 2024 in cs.LG, cs.AI, cs.CV, and cs.NE

Abstract: Recently, the number of parameters in DNNs has explosively increased, as exemplified by LLMs, making inference on small-scale computers more difficult. Model compression technology is, therefore, essential for integration into products. In this paper, we propose a method of quantization-aware training. We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional computation cost during inference. Then, we quantize the weights by the scaled round-clip function with the weight standardization. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions. We call this method Magic for the age of Quantised DNNs (MaQD). Experimental results show that our quantization method can be achieved with minimal accuracy degradation.

Summary

The paper introduces MaQD, integrating quantization-aware training with a novel Layer-Batch Normalization to eliminate mini-batch size dependency.
It employs scaled round-clip functions and surrogate gradients to effectively quantize weights and activations with minimal accuracy loss.
Experimental outcomes show preserved inference accuracy and substantial parameter reduction across CIFAR datasets and various network architectures.

Magic for the Age of Quantized DNNs: Introducing Layer-Batch Normalization and MaQD

Introduction to Quantization-Aware Training

Deep Neural Networks (DNNs) have seen a rapid increase in the number of parameters, challenging their deployment on resource-constrained devices. This has made model compression techniques critical for practical applications. This paper introduces a method known as Magic for the age of Quantized DNNs (MaQD), which employs quantization-aware training. The novelty of this approach lies in the introduction of Layer-Batch Normalization (LBN), a normalization technique that is independent of mini-batch size and demands no additional computation during inference.

Layer-Batch Normalization (LBN)

LBN is proposed as a novel normalization method that balances the benefits of existing normalization techniques while allowing for efficient training with smaller mini-batch sizes. Unlike Batch Normalization (BN) and Layer Normalization (LN), LBN achieves a mini-batch size independence, which is vital for reducing computational resource requirements. The experimental results showcased in this paper demonstrate the superiority of LBN over BN and LN in various settings, including different mini-batch sizes.

Magic for the Age of Quantized DNNs (MaQD)

The MaQD methodology encapsulates quantization-aware training by employing scaled round-clip functions for both weight and activation function quantization. By integrating LBN with weight standardization and employing surrogate gradients for training, MaQD efficiently trains quantized DNNs with minimal accuracy loss. This approach addresses the twin challenges of compression efficiency and inference accuracy in quantized deep learning models.

Experimental Outcomes

The experimental evaluation of MaQD presents compelling evidence for its effectiveness across different datasets (CIFAR-10, CIFAR-100) and network architectures (VGG, PreActResNet). The results illustrate that MaQD can maintain high levels of inference accuracy with significant parameter reduction, highlighting the trade-off between compression efficiency and inference accuracy. Specifically, configurations with a moderate degree of quantization (e.g., $M_{\rm w} = 15$ and $M_{\rm a} = 8$ ) could offer an optimal balance.

Implications and Future Directions

The introduction of LBN and the MaQD framework presents several theoretical and practical implications for the field of DNNs. Theoretically, it challenges the traditional dependencies on large mini-batch sizes and complex normalization methods. Practically, it offers a pathway towards the efficient deployment of complex DNNs on resource-constrained devices. Future research directions could explore the application of MaQD to larger datasets, more complex network architectures, and integration with hardware-specific optimizations for further enhanced performance.

Conclusion

This paper presents a narrative that bridges the gap between the demand for computational efficiency and the necessity for high accuracy in quantized DNNs. By proposing LBN and embedding it within the MaQD framework for quantization-aware training, it lays down a significant marker for future research in the domain of model compression and efficient DNN deployment on edge devices. The journey towards achieving 'ubiquitous magic' in DNN deployment takes a promising step forward with the methodologies introduced in this paper.

PDF Markdown

Related Papers

Tweets

https://twitter.com/BrianRoemmele/status/1772332266674385352