Emergent Mind

Low-Precision Mixed-Computation Models for Inference on Edge

(2312.02210)
Published Dec 3, 2023 in cs.LG and cs.AI

Abstract

This paper presents a mixed-computation neural network processing approach for edge applications that incorporates low-precision (low-width) Posit and low-precision fixed point (FixP) number systems. This mixed-computation approach employs 4-bit Posit (Posit4), which has higher precision around zero, for representing weights with high sensitivity, while it uses 4-bit FixP (FixP4) for representing other weights. A heuristic for analyzing the importance and the quantization error of the weights is presented to assign the proper number system to different weights. Additionally, a gradient approximation for Posit representation is introduced to improve the quality of weight updates in the backpropagation process. Due to the high energy consumption of the fully Posit-based computations, neural network operations are carried out in FixP or Posit/FixP. An efficient hardware implementation of a MAC operation with a first Posit operand and FixP for a second operand and accumulator is presented. The efficacy of the proposed low-precision mixed-computation approach is extensively assessed on vision and language models. The results show that, on average, the accuracy of the mixed-computation is about 1.5% higher than that of FixP with a cost of 0.19% energy overhead.

Overview

  • Introduces a mixed-computation approach using 4-bit Posit and 4-bit fixed-point numbers to improve DNN inference on edge devices.

  • Allocates Posit and fixed-point representations strategically based on layer sensitivity to quantization errors.

  • Employs a sensitivity analysis algorithm to decide the quantization method for each layer.

  • Develops a custom gradient approximation method for backpropagation suitable for Posit-quantized networks.

  • Demonstrates performance improvements and minimal energy overhead on various machine learning models.

In the world of machine learning, especially on edge devices, computational efficiency is crucial. Traditional methods for accelerating deep neural network (DNN) inference on such devices often involve quantization—the process of mapping continuous or high bit-width numbers down to low bit-width representations. A recent study innovates in this domain by introducing a mixed-computation framework that utilizes the strengths of two distinct numerical systems: low-precision Posit numbers and fixed-point numbers.

The mixed-computation approach presented in the paper hinges on the strategic allocation of 4-bit Posit (Posit4) and 4-bit fixed point (FixP4) representations within a neural network. The method focuses on assigning Posit4 to the weights of layers that are sensitive to quantization errors, wherein higher precision can significantly impact model accuracy. In contrast, FixP4 is used for weights in layers that are less affected by quantization errors—benefiting from the hardware efficiency of the fixed-point system.

Posit numbers carry a notable advantage over traditional fixed-point and floating-point formats due to their dynamic range and precision. Moreover, the Posits exhibit gradual underflow and overflow behaviors, which prevent abrupt shifts to infinity or zero, thus preserving more information compared to other number systems.

To decide which neural network layers are quantized using which number system, the researchers developed a sensitivity analysis algorithm. This algorithm assesses the quantization error and the impact of weights on the network's overall output, thus guiding the allocation of number representations across the network's architecture.

The study introduces a custom gradient approximation method for backpropagation, which is particularly suited for Posit quantizer due to its non-uniform nature. This innovation allows weight updates to be more accurate during training. Moreover, hardware implementation of the Posit/FixP computations was also considered, resulting in an efficient design for multiply-accumulate (MAC) operations that is indispensable for DNN tasks.

Evaluation of this mixed-computation method across various vision and language models demonstrated a consistent performance improvement over models using fixed-point quantization exclusively. The accuracy gains averaged at about 1.5%, with an energy overhead of just 0.19%. When applied to ubiquitous machine learning models like ResNet, VGG, MobileNet, BERT, and GPT, the method showcased particularly compelling performance advantages.

The study vividly illustrates that this mixed-computation approach bears little added energy cost considering the MAC unit's consumption in the context of the overall system. The employment of Posit numbers, with their enhanced precision and broad dynamic range, alongside the widely used FixP numbers, indicates a promising path forward for optimizing machine learning models on edge devices. This is especially relevant as demands for on-device AI increase, while privacy concerns and real-time processing needs drive more intelligent computation to the edge.

In summary, the research presents an innovative method that adeptly balances computational efficiency with model accuracy. As machine learning applications continue to pervade everyday devices and necessitate local, real-time processing, such advancements in low-precision computation models could be pivotal in enabling smarter edge-based AI without straining device resources.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.