Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 133 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Online normalizer calculation for softmax (1805.02867v2)

Published 8 May 2018 in cs.PF, cs.AI, and cs.CL

Abstract: The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware. The benchmarks confirm this hypothesis: Softmax accelerates by up to 1.3x and Softmax+TopK combined and fused by up to 5x.

Citations (57)

Summary

  • The paper proposes an algorithm that computes the Softmax normalizer with reduced memory accesses, achieving up to a 1.3x speed-up in performance.
  • It uses a single-pass online method to compute both the maximum and normalizer simultaneously, ensuring numerical stability through induction.
  • The approach supports further optimizations, such as fusing Softmax with TopK operations, yielding improvements up to 5x in real-world applications.

Online Normalizer Calculation for Softmax: A Performance Enhancement Analysis

The paper "Online Normalizer Calculation for Softmax" by Maxim Milakov and Natalia Gimelshein addresses a critical performance bottleneck in neural network LLMs and multinomial logistic regression: the computation of the Softmax function. Despite various alternatives proposed, including Differentiated Softmax and SVD-Softmax, many existing methods still require the execution of the classical Softmax function, often resulting in inefficient computation due to repeated memory accesses.

Key Contributions

The crux of the research is the introduction of an algorithm that computes the normalizer for the Softmax function with reduced memory accesses. The authors hypothesize that this reduction would enhance the performance on actual hardware, and their benchmarks confirm significant improvements. The proposed "Online Softmax" method reduces the memory accesses needed from four to three per vector element, achieving up to a 1.3x speed-up for the Softmax alone compared to traditional implementations.

The innovation hinges on the use of an online algorithm for computing both the maximum value and the normalizer in a single pass, reminiscent of existing numerically stable online algorithms for variance calculation. The correctness and numerical stability of the method are rigorously established using induction.

Implications of the Research

The implications of this research are multifaceted, impacting both theoretical exploration and practical applications:

  • Theoretical Advancements: The paper contributes to the theoretical landscape by providing an efficient algorithmic solution that maintains numerical stability, proving its suitability for deep learning frameworks that prioritize accuracy.
  • Practical Implementations: Practically, the reduction in memory accesses could lead to significant performance improvements in high-performance computing environments. Benchmarks on NVIDIA's Tesla V100 reveal notable acceleration in Softmax operations (up to 1.3x). For applications involving the Softmax followed by the TopK operation, a fusible approach yields up to 5x improvement, illustrating the benefits of reducing redundant memory operations for real-world efficiency.

Future Directions

The potential applications of this method extend beyond standard Softmax. The approach is orthogonal to other optimization techniques such as Hierarchical Softmax or SVD-Softmax, suggesting room for combinatorial improvements. Moreover, while this work focuses on GPU benchmarks, exploring performance on other architectures, like vectorized CPU implementations, remains an open avenue.

The paper also hints at further optimization possibilities by fusing Softmax with preceding computational layers, which could eliminate memory round-trips entirely. However, such optimizations would require overcoming challenges associated with deep pipeline integration.

Conclusion

Milakov and Gimelshein's work exemplifies a strategic optimization of deep learning components by addressing a fundamental performance constraint in Softmax computations. The results hold promise for improving computational efficiency across various AI applications, emphasizing the relevance of memory access patterns in optimizing neural network performance.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 tweets and received 269 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube