Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Rethinking floating point for deep learning (1811.01721v1)

Published 1 Nov 2018 in cs.NA and cs.LG

Abstract: Reducing hardware overhead of neural networks for faster or lower power inference and training is an active area of research. Uniform quantization using integer multiply-add has been thoroughly investigated, which requires learning many quantization parameters, fine-tuning training or other prerequisites. Little effort is made to improve floating point relative to this baseline; it remains energy inefficient, and word size reduction yields drastic loss in needed dynamic range. We improve floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson's posit format. With no network retraining, and drop-in replacement of all math and float32 parameters via round-to-nearest-even only, this open-sourced 8-bit log float is within 0.9% top-1 and 0.2% top-5 accuracy of the original float32 ResNet-50 CNN model on ImageNet. Unlike int8 quantization, it is still a general purpose floating point arithmetic, interpretable out-of-the-box. Our 8/38-bit log float multiply-add is synthesized and power profiled at 28 nm at 0.96x the power and 1.12x the area of 8/32-bit integer multiply-add. In 16 bits, our log float multiply-add is 0.59x the power and 0.68x the area of IEEE 754 float16 fused multiply-add, maintaining the same signficand precision and dynamic range, proving useful for training ASICs as well.

Citations (133)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)