A Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness (2309.03004v4)

Published 6 Sep 2023 in cs.LG

Abstract: A recent empirical observation (Li et al., 2022b) of activation sparsity in MLP blocks offers an opportunity to drastically reduce computation costs for free. Although having attributed it to training dynamics, existing theoretical explanations of activation sparsity are restricted to shallow networks, small training steps and special training, despite its emergence in deep models standardly trained for a large number of steps. To fill these gaps, we propose the notion of gradient sparsity as one source of activation sparsity and a theoretical explanation based on it that sees sparsity a necessary step to adversarial robustness w.r.t. hidden features and parameters, which is approximately the flatness of minima for well-learned models. The theory applies to standardly trained LayerNorm-ed MLPs, and further to Transformers or other architectures trained with weight noises. Eliminating other sources of flatness except for sparsity, we discover the phenomenon that the ratio between the largest and smallest non-zero singular values of weight matrices is small. When discussing the emergence of this spectral concentration, we use random matrix theory (RMT) as a powerful tool to analyze stochastic gradient noises. Validational experiments are conducted to verify our gradient-sparsity-based explanation. We propose two plug-and-play modules for both training and finetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their 50% sparsity improvements, indicating further potential cost reduction in both training and inference.

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that gradient sparsity, achieved through the bias toward flat minima, is a key driver of activation sparsity and implicit adversarial robustness.
Empirical validation on ImageNet-1K and C4 confirms that specific architectural modifications can improve activation sparsity by up to 50%.
The findings offer actionable insights into designing cost-efficient, robust neural networks by linking training dynamics with sparsity patterns.

Theoretical Insights on Sparsity in Deep Learning: Activation, Gradient, and Implications for Robustness

Gradient Sparsity as a Source of Activation Sparsity

Recent research has revisited multi-layer perceptron (MLP) blocks, probing into activation sparsity—where only a small fraction of neurons are active during inference. Notably, an empirical paper has unveiled significant activation sparsity across various architectures and tasks without any explicit regularization, suggesting avenues for cost-efficient inference through neuron pruning. Despite previous attempts to understand this phenomenon through the lens of training dynamics, existing explanations fall short when extended to deep networks or large training steps under standard protocols.

In response, we propose a novel perspective that identifies gradient sparsity as a primary contributor to activation sparsity. Our theoretical framework establishes a link between gradient sparsity, implicit adversarial robustness (IAR), and the flatness of minima. The argument hinges on the observation that standard training practices, characterized by stochastic gradients, inherently favor models that navigate towards flatter minima. These models demonstrate robustness to perturbations in hidden features and parameters, a property we argue is facilitated by sparse gradients. Specifically, we prove that gradient sparsity significantly contributes to the model's implicit adversarial robustness and is thus encouraged by the inherent bias towards flat minima during training. This leads to a natural emergence of activation sparsity in deep layers, particularly in networks featuring BatchNorm and ReLU activations.

Empirical Validation and Architectural Modifications

Our theoretical analyses are corroborated by extensive experiments conducted on widely recognized datasets such as ImageNet-1K and C4. In these experiments, we introduce two novel architectural modifications—namely, Zeroth Biases and J-Squared ReLU—that are explicitly designed to enhance sparsity. These modifications are informed by our theoretical findings and are shown to significantly improve activation sparsity without compromising model performance. For instance, models trained with these modifications on ImageNet-1K exhibit a 50% improvement in sparsity metrics, demonstrating the practical applicability and effectiveness of our theoretical insights.

Beyond Activation Sparsity: Broader Implications

The implications of our work extend beyond the specific context of activation sparsity. By establishing a concrete link between gradient sparsity and implicit adversarial robustness, we provide a new dimension for understanding model behavior and optimization in deep learning. This connection invites further investigation into the design of more cost-effective and robust neural architectures. Additionally, our findings shed light on the interplay between model architecture, training dynamics, and the resulting sparsity patterns, offering a richer framework for exploring these relationships.

Future Directions and Conclusion

Our work opens several avenues for future research. One immediate direction involves exploring the generalizability of our findings across a broader spectrum of architectures and tasks. Additionally, further theoretical work is needed to tighten the connections between gradient sparsity, flatness of the loss landscape, and model robustness. Lastly, the development of novel architectural modifications and training routines that leverage our insights to achieve even greater efficiency and robustness represents an exciting area of exploration.

In conclusion, this paper advances our understanding of activation sparsity in deep learning by highlighting the role of gradient sparsity. Through rigorous theoretical analyses and empirical validation, we establish gradient sparsity as a key mechanism through which models achieve both activation sparsity and adversarial robustness. These findings not only demystify the origins of activation sparsity but also open new horizons for designing efficient and robust neural networks.