DeepViT: Towards Deeper Vision Transformer

Published 22 Mar 2021 in cs.CV | (2103.11886v4)

Abstract: Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet. Code is publicly available at https://github.com/zhoudaquan/dvit_repo.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (472)

View on Semantic Scholar

Summary

The paper introduces Re-attention to counteract attention collapse in deeper Vision Transformers.
The methodology enhances inter-head diversity, achieving a 1.6% Top-1 accuracy improvement on ImageNet.
These insights facilitate scaling ViT architectures effectively for advanced computer vision applications.

An Analytical Overview of DeepViT: Towards Deeper Vision Transformers

The paper "DeepViT: Towards Deeper Vision Transformer" investigates the scalability of Vision Transformers (ViTs) concerning depth, addressing a notable impediment in their performance as models become deeper. Unlike Convolutional Neural Networks (CNNs), the effectiveness of ViTs does not consistently improve with increased layers due to an identified phenomenon termed as "attention collapse."

Key Observations

The researchers found that ViTs exhibit performance saturation when the network depth exceeds a certain threshold. This is attributed to the attention collapse, where self-attention maps begin to lose diversity, becoming overly similar in deeper layers. This phenomenon hinders the model's ability to extract novel and rich representations, thus stagnating the performance.

Methodological Contribution

To counteract the attention collapse, the authors propose a novel technique named Re-attention. This mechanism dynamically regenerates attention maps by leveraging interactions between different attention heads within a transformer block. The process involves a learnable transformation matrix that efficiently mixes attention maps across heads, significantly boosting map diversity without substantial computational overhead.

Empirical Results

The proposed Re-attention demonstrates notable improvements in classification accuracy on the ImageNet-1k dataset. For instance, applying Re-attention to a deep ViT model with 32 blocks resulted in a 1.6% improvement in Top-1 accuracy. Importantly, these gains are achieved without reliance on pre-training with extra large-scale datasets. This empirical success underscores Re-attention’s efficacy in facilitating deeper ViTs while maintaining computational practicality.

Comparison and Implications

The paper draws parallels between the depth scalability challenges of ViTs and the early limitations seen with CNNs, wherein deeper architectures initially failed to provide expected performance gains. However, contemporary advancements in CNNs have leveraged architectural modifications to overcome these issues. The proposed Re-attention plays a similar role for ViTs, showing promise in mitigating constraints by enhancing intra-layer diversity through minimal modifications.

Theoretical and Practical Implications

Theoretically, the study strengthens the understanding of how self-attention mechanisms operate within deep networks, potentially guiding further exploration into training strategies that avoid attention redundancy. Practically, the ability to scale ViTs effectively opens avenues for deploying these models in demanding tasks where richer feature extraction is critical.

Future Directions

Given these developments, future research might explore further optimizations of Re-attention, architecting new blocks that synergistically incorporate this mechanism. Additionally, its integration with other transformer adaptations could yield multifaceted improvements. The insights on managing attention collapse may also inform techniques for efficient training of other transformer-based architectures beyond vision tasks.

Conclusion

The DeepViT framework illuminates a pathway towards effectively utilizing deeper ViT architectures, addressing intrinsic challenges through Re-attention. This contribution not only advances the current understanding of transformer scalability but also sets a foundation for future innovations in deep learning methodologies and applications.

Markdown Report Issue