From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Published 5 Feb 2016 in cs.CL, cs.LG, and stat.ML | (1602.02068v2)

Abstract: We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, we show how its Jacobian can be efficiently computed, enabling its use in a network trained with backpropagation. Then, we propose a new smooth and convex loss function which is the sparsemax analogue of the logistic loss. We reveal an unexpected connection between this new loss and the Huber classification loss. We obtain promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieve a similar performance as the traditional softmax, but with a selective, more compact, attention focus.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (666)

View on Semantic Scholar

Summary

The paper introduces sparsemax, an alternative to softmax that projects vectors onto a probability simplex to yield sparse outputs.
It details efficient gradient computation, making sparsemax suitable for backpropagation in neural networks handling multi-label tasks.
Empirical results show improved interpretability and competitive performance in attention mechanisms and multi-label classification.

Sparsemax: A Sparse Model of Attention and Multi-Label Classification

The paper presents a novel activation function termed "sparsemax," positioned as an alternative to the conventional softmax function. Sparsemax is characterized by its ability to produce sparse probabilities, addressing cases where only a subset of potential labels or features warrant attention. This function is primarily developed for applications in multi-label classification and neural attention mechanisms.

Core Contributions

Sparsemax Definition and Properties: Sparsemax projects real-valued vectors onto a probability simplex, potentially yielding sparse output distributions. The authors derive mathematical properties, illustrating that sparsemax retains many desirable traits of softmax while enabling sparsity.
Gradient and Differentiability: The efficient computation of the Jacobian for sparsemax facilitates its integration with gradient-based optimization, such as backpropagation in neural networks. Sparsemax proves to be computationally advantageous due to its simpler gradients.
Sparsemax Loss Function: The authors introduce a new loss function akin to the logistic loss, called the sparsemax loss. This convex and differentiable loss demonstrates a connection to the Huber loss and is capable of generating sparse gradients, making it suitable for multi-label and multi-class settings.
Empirical Evaluations: Sparsemax is evaluated on multi-label classification datasets, outperforming softmax in scenarios with larger label spaces. Additionally, its integration into attention mechanisms within neural networks for natural language inference showcases comparable or superior interpretability and performance to traditional softmax-based attention models.

Implications and Future Directions

Sparsemax's ability to provide more interpretable models by producing sparse outputs is particularly significant for applications where model transparency is essential. This function has the potential to be integrated into various architectures requiring selective focus, such as attention mechanisms in memory networks or situations demanding hierarchical attention.

The paper hints at sparsemax being less GPU-friendly due to the need for sort operations. Future refinement might focus on optimizing these operations for improved computational efficiency.

Moreover, sparsemax could benefit applications in reinforcement learning and probabilistic modeling where interpretability and model sparsity are sought after, fostering developments in AI models that leverage sparse representation.

Overall, sparsemax represents a meaningful contribution to enhancing the flexibility and interpretability of activation mechanisms in complex machine learning models, offering an interesting alternative that balances the advantages of both soft and hard attention strategies.

Markdown Report Issue