Emergent Mind

Abstract

Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains $83.6\%$ top-1 accuracy on ImageNet-1K with $16.2$ms latency, which is $2.4$ms less than that of Flatten-Swin with $0.1\%$ higher accuracy. We also evaluated our method for language modeling task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.

Overview

  • The paper introduces Progressive Re-parameterized BatchNorm (PRepBN) to improve the efficiency of transformers by gradually transitioning from LayerNorm to BatchNorm, enhancing training stability and inference speed.

  • Simplified Linear Attention (SLA) reduces the quadratic complexity of traditional attention mechanisms by using ReLU as a kernel function and depth-wise convolution for local feature enhancement.

  • These innovations enable transformers to achieve high accuracy with reduced latency, making them more deployable on resource-constrained devices and paving the way for further advancements in normalization and attention mechanisms.

Understanding SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Batch Normalization

Introduction

Transformers have been game changers in both NLP and computer vision. However, their significant computational demands make it difficult to use them on resource-constrained devices. The paper we're looking into today tackles this issue by focusing on optimization within transformers, specifically the computation-heavy normalization layers and attention modules.

Key Innovations

1. Progressive Re-parameterized BatchNorm (PRepBN)

Why it Matters: Layer Normalization (LayerNorm) is standard in transformers but isn't computationally friendly due to real-time statistic calculations during inference. The alternative, BatchNorm, usually leads to performance issues when used in transformers. PRepBN is designed to address these limitations.

How it Works:

  • Progressive Strategy: This method gradually transitions from LayerNorm to BatchNorm during training. Initially, the model relies on LayerNorm, which provides stability, and over time shifts to BatchNorm, which is faster during inference.
  • Re-parameterized BatchNorm: To further stabilize training, PRepBN introduces a parameter that modulates the BatchNorm, enhancing training stability.

Results Highlight: PRepBN is shown to be powerful for both image classification and object detection. For instance, SLAB-Swin achieves 83.6% top-1 accuracy on ImageNet-1K with a latency of 16.2ms, outperforming the previous Flatten-Swin in both speed and accuracy.

2. Simplified Linear Attention (SLA)

Why it Matters: The traditional attention mechanism in transformers is computationally expensive due to its quadratic complexity. Linear attention aims to make these calculations more efficient.

How it Works:

  • Simplification: SLA uses ReLU as the kernel function and incorporates a depth-wise convolution for local feature enhancement. This approach is simpler and more efficient.
  • Decoupling: By splitting the calculations in a specific way, SLA effectively reduces the computational complexity while maintaining performance.

Results Highlight: On various benchmarks, the SLAB transformer equipped with SLA outperformed existing models. It achieves significant latency reductions while maintaining comparable accuracy levels.

Broader Implications

Theoretical Implications

  • Normalization Strategy: The success of PRepBN could pave the way for more advanced normalization techniques that offer a balance between computational efficiency and model stability.
  • Attention Mechanisms: SLA's effectiveness suggests that research into simpler, linear attention mechanisms could be a fertile ground for further innovations.

Practical Implications

  • Scalability: These optimizations mean that powerful transformers can be deployed on less powerful hardware.
  • Efficiency Gains: Industries can use these techniques to reduce operational costs related to computational resources, making sophisticated models more accessible.

Future Directions

One can speculate several intriguing avenues for future research and practical implementations:

  1. Adaptation to Various Domains: While the paper mainly focuses on vision and language models, future research could adapt these techniques to other fields such as reinforcement learning or time-series analysis.
  2. Hybrid Models: Combining PRepBN and SLA with other efficiency techniques could yield even more scalable transformers.
  3. Further Optimization: There is always room for more fine-tuning in the progressive transition strategy and the linear attention mechanism to achieve better performance.

Conclusion

The improvements proposed in this paper, namely the Progressive Re-parameterized BatchNorm and Simplified Linear Attention, demonstrate tangible advancements in making transformers more efficient without sacrificing performance. By carefully rethinking normalization layers and attention modules, the researchers have paved the way for more accessible and scalable transformer architectures. The discussed techniques are not just incremental improvements but foundational steps that make efficient transformers feasible in real-world, resource-constrained environments. This is indeed an exciting step forward for the deployment of advanced AI models in broader applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.