Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers (2407.07848v1)

Published 10 Jul 2024 in cs.LG and cs.AI

Abstract: Previous work has demonstrated that MLPs within ReLU Transformers exhibit high levels of sparsity, with many of their activations equal to zero for any given token. We build on that work to more deeply explore how token-level sparsity evolves over the course of training, and how it connects to broader sparsity patterns over the course of a sequence or batch, demonstrating that the different layers within small transformers exhibit distinctly layer-specific patterns on both of these fronts. In particular, we demonstrate that the first and last layer of the network have distinctive and in many ways inverted relationships to sparsity, and explore implications for the structure of feature representations being learned at different depths of the model. We additionally explore the phenomenon of ReLU dimensions "turning off", and show evidence suggesting that "neuron death" is being primarily driven by the dynamics of training, rather than simply occurring randomly or accidentally as a result of outliers.

References (12)

Authors (2)

Cody Wild (10 papers)
Jesper Anderson (2 papers)

Summary

The paper reveals that different Transformer layers exhibit unique sparsity patterns, with early layers using 13.3% and final layers 95.6% of available neurons per batch.
The study demonstrates an anticorrelation between per-token and per-batch sparsity, highlighting that layers with low per-token activation can extend usage across sequences.
The paper identifies varying neuron death dynamics across layers, suggesting strategic opportunities for model pruning and improved training efficiency.

Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers

The paper "Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers" by Cody Wild and Jesper Anderson from Google Research explores the intricacies of activation sparsity in the Multi-Layer Perceptrons (MLPs) within ReLU-based Transformer models. Building upon the existing knowledge that ReLU activations inherently induce sparsity, the authors aim to explore how this token-level sparsity evolves during training and how it reflects broader sparsity patterns across sequences and batches.

Key Findings and Numerical Results

The authors identify several crucial insights about how sparsity evolves and behaves differently across various layers of a small Transformer model:

Layer-Specific Sparsity Patterns:
- The paper reveals that different layers of the Transformer model exhibit distinct sparsity behaviors, with the first and final layers having opposing patterns.
- At convergence, the first layer uses 13.3% of its available hidden units per batch, while the final layer uses 95.6%.
Per-Token and Per-Batch Sparsity:
- There is a notable anticorrelation between per-token and per-batch sparsity. Layers using the fewest hidden units per token are often those activating the most dimensions over a sequence or batch.
- The initial layer activates 4.1% of its hidden units per token but extends to 13.3% over a batch, whereas the final layer increases from 3.0% per token to 95.6% per batch.
Neuron Death Dynamics:
- Neuron death, where neurons consistently fall to an inactive state, varies across layers. While a considerable number of hidden units in the first layer are turned off during training, higher layers gradually turn on more neurons.
- Interestingly, neuron death only occurs in specific training regimes, suggestive of an interaction with the learning dynamics rather than random outliers. Approximately 5% of hidden units remain inactive from initialization, implying potential gains from optimized initializations.

Methodology

Metrics

Three core sparsity metrics are defined:

Per-Token Hidden Unit Use: Non-zero post-ReLU activations averaged over tokens.
Per-Sequence Hidden Unit Use: Non-zero activations across any token in a sequence.
Per-Batch Hidden Unit Use: Non-zero activations across any token in the batch.

Additionally, percentile metrics quantify the frequency of neuron use over sequences, focusing on percentiles like the 50th and 90th.

Model Architecture

The primary experiments are conducted on a 6-layer decoder-only Transformer model, with a hidden dimension of 32768, and ReLU in MLPs. Training is performed using the C4 dataset with a standard setup involving LeCun Normal initialization, AdamW optimizer, and a cosine decay learning rate. Various ablations are also performed to explore the effects of different model depths, hidden dimensions, and learning rates.

Implications and Future Directions

The research presents significant implications for understanding capacity utilization in Transformer models. The dramatic layer-dependent differences in sparsity behavior suggest that models learn fundamentally different types of features at varying depths. The sporadic nature of neuron activation in higher layers compared to the consistent activation in lower layers indicates a shift from dense, continuous feature spaces to more sparse, binary-like representations.

Practical Implications

If confirmed by further paper, the observation that a significant fraction of neurons can be turned off early in training without accuracy loss provides a practical avenue for model pruning and efficiency improvements. This could reduce computational costs and potentially accelerate training times.

Theoretical Implications

The findings challenge the conventional view of neuron death as an accidental side-effect of training. Instead, they suggest it is an emergent property of model learning dynamics. This opens new avenues for research into initialization schemes and training regimes that might optimize the utilization of model capacity.

Future Developments in AI

Speculating on future developments, one could envision research aimed at better characterizing the emergent sparsity patterns to develop more efficient training algorithms and model architectures. This could lead to the design of Transformer models that inherently exploit these sparsity patterns for improved performance and efficiency.

In conclusion, this paper provides a deep dive into the sparsity patterns in ReLU-based Transformers, offering both practical insights for model optimization and theoretical contributions to our understanding of neural network training dynamics. As models grow in complexity and size, such nuanced studies will be crucial in guiding the development of more efficient and effective AI systems.

Related Papers

Tweets

https://twitter.com/fly51fly/status/1811392944341225892