The Devil in Linear Transformer

Published 19 Oct 2022 in cs.CL and cs.LG | (2210.10340v1)

Abstract: Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For the issue of attention dilution, we leverage a diagonal attention to confine attention to only neighbouring tokens in early layers. Benefiting from the stable gradients and improved attention, our new linear transformer model, transNormer, demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin while being significantly more space-time efficient. The code is available at https://github.com/OpenNLPLab/Transnormer .

Abstract PDF Upgrade to Chat

Citations (55)

View on Semantic Scholar

Summary

The paper identifies that conventional scaling in kernel-based linear attention leads to unbounded gradients, undermining training stability.
It reveals that attention dilution limits local context capture, causing degraded performance in early transformer layers.
The paper introduces TransNormer with NormAttention and DiagAttention, which stabilize gradients and preserve local semantic relationships.

Introduction

The paper "The Devil in Linear Transformer" (2210.10340) provides a critical examination of kernel-based linear transformers, identifying two primary technical shortcomings—unbounded gradients and attention dilution—and proposes architectural modifications in the TransNormer model to address these issues. The analysis is situated in the context of reducing the quadratic complexity associated with vanilla transformers while maintaining competitive performance on tasks ranging from language modeling to long-sequence benchmarks.

Technical Analysis of Identified Issues

Unbounded Gradients

The study rigorously shows that the conventional scaling mechanism employed in kernel-based linear attention leads to unbounded gradients. This occurs because the derivative of the attention output with respect to token-wise similarities can escalate without bound, leading to unstable training dynamics. By examining the theoretical gradient expressions, the paper underscores that the scaling factor—often justified from the signal propagation perspective in vanilla transformers—is unnecessary in the context of linear attention. The resulting instability is quantitatively characterized by a higher relative standard deviation in gradient norms during the backward pass, which directly correlates with convergence issues in practice.

Attention Dilution

A pivotal observation in the paper is that standard linear attention mechanisms tend to "dilute" attention across the entire sequence rather than focusing on locally relevant tokens, particularly in the early layers. Empirical attention visualizations and locally accumulated attention metrics reveal that such a spread hinders the model from effectively capturing local contextual dependencies. This contrasts sharply with vanilla transformers where the inherent softmax weighting results in more concentrated attentional distributions over neighboring tokens. The authors quantify this discrepancy, demonstrating that attention dilution severely undermines the model's ability to preserve local information, leading to degraded performance on tasks that rely on fine-grained local semantics.

Architectural Innovations in TransNormer

NormAttention: Stabilizing Gradients

As a remedy to the gradient instability, the paper introduces NormAttention. Instead of employing the conventional scaling operation, the method applies a normalization step (e.g., LayerNorm or RMSNorm) post-attention computation. This approach effectively bounds the gradients while preserving the linear attention structure. Theoretical derivations provided in the paper show that the gradient of the normalized attention is bounded by an expression involving the loss gradient, norms of the value matrix, and the RMSNorm epsilon parameter. Empirical results demonstrate a significant reduction in gradient variance—notably, a measured decrease in the relative standard deviation of gradients in early layers—which contributes to more stable convergence during training.

DiagAttention: Constraining Local Context

To address the issue of attention dilution, the paper proposes incorporating DiagAttention in the early layers of the model. DiagAttention employs a block-wise diagonal attention mask that restricts tokens to attend primarily to their immediate neighbors. This localized attention mechanism counteracts the tendency for attention scores to spread uniformly across long sequences, ensuring that local semantic relationships are adequately modeled. Computational experiments indicate that even with the imposition of a strict local constraint, the overall complexity remains linear. Moreover, the integration of DiagAttention in early layers yields an observable performance improvement on tasks sensitive to local context, as evidenced by performance gains on text classification benchmarks.

Empirical Results

The TransNormer model, combining NormAttention and DiagAttention, is evaluated against vanilla transformers and other state-of-the-art linear transformer variants. Key performance metrics across multiple settings include:

Text Classification and Language Modeling: The TransNormer demonstrates improved accuracy and perplexity scores, surpassing both vanilla and existing linear models. The bounded gradient mechanism enables deeper network architectures without succumbing to training instabilities.
Long-Range Arena (LRA) Benchmark: On challenging long-sequence tasks, the model achieves superior performance while maintaining linear space-time complexity, marking a substantial improvement over established baselines.
Gradient Analysis: A comparative study of gradient norms reveals that the normalized attention mechanism yields tighter gradient distributions, which translates to faster and more robust convergence during training.

These quantitative improvements underscore the efficacy of the proposed architectural enhancements, as the TransNormer clearly mitigates the detrimental effects of unbounded gradients and attention dilution.

Conclusion

In summary, "The Devil in Linear Transformer" delivers a technically rigorous analysis of the limitations inherent in kernel-based linear transformer models. By theoretically and empirically substantiating that the scaling operation and subsequent attention dilution are contributing factors to performance degradation, the paper justifies the architectural reforms implemented in the TransNormer model. The introduction of NormAttention stabilizes gradient propagation while DiagAttention preserves local contextual integrity, culminating in a model that not only achieves state-of-the-art performance on standard benchmarks but also retains the linear space-time efficiency desired in scaling to long sequences. This work is particularly relevant for applications requiring efficient processing of lengthy inputs, where traditional quadratic attention mechanisms become computationally prohibitive.

Markdown Report Issue