Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

MABViT -- Modified Attention Block Enhances Vision Transformers (2312.01324v2)

Published 3 Dec 2023 in cs.CV and cs.LG

Abstract: Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in LLMs. Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Big Vision. https://github.com/google-research/big˙vision.
  2. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311v5.
  3. Randaugment: Practical Automated Data Augmentation with a Reduced Search Space. In Advances in Neural Information Processing Systems, volume 33, 18613–18624.
  4. Language Modeling with Gated Convolutional Networks.
  5. Scaling Vision Transformers to 22 Billion Parameters.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929.
  7. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
  8. Improving Transformer Optimization Through Better Initialization. In International Conference on Machine Learning, 4475–4483. PMLR.
  9. Dual PatchNorm. arXiv, 2302: 2302.01327.
  10. Understanding the Difficulty of Training Transformers.
  11. Shazeer, N. 2020. GLU Variants Improve Transformer.
  12. Talking-Heads Attention.
  13. NormFormer: Improved Transformer Pretraining with Extra Normalization. arXiv, 2110: 2110.09456.
  14. How to Train Your ViT? Data, Augmentation, and Regularization in Vision Transformers. Transactions on Machine Learning Research.
  15. Going Deeper with Image Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6144–6153. IEEE.
  16. Attention Is All You Need. arXiv:1706.03762.
  17. Deepnet: Scaling Transformers to 1,000 Layers.
  18. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax. May 2021.
  19. ResiDual: Transformer with Dual Residual Connections. arXiv:2304.14802.
  20. On Layer Normalization in the Transformer Architecture. In Proceedings of the 35th International Conference on Machine Learning, 12126–12135. PMLR.
  21. Scaling Vision Transformers. Conference on Computer Vision and Pattern Recognition (CVPR).
  22. Mixup: Beyond Empirical Risk Minimization.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.