Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers (2403.10030v3)

Published 15 Mar 2024 in cs.CV

Abstract: Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (e.g., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Learned queries for efficient local attention. In CVPR, 2022.
  2. Multimae: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
  3. Longformer: The long-document transformer. arXiv:2004.05150, 2020.
  4. Token merging: Your vit but faster. ICLR, 2022.
  5. End-to-end object detection with transformers. In ECCV, 2020.
  6. Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV, 2021.
  7. Tokenmixup: Efficient attention-guided token-level data augmentation for transformers. In NeurIPS, 2022.
  8. Rethinking attention with performers. ICLR, 2021.
  9. Twins: Revisiting the design of spatial attention in vision transformers. NeurIPS, 2021.
  10. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  11. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
  13. Adaptive token sampling for efficient vision transformers. In ECCV, 2022.
  14. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  15. Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
  16. All tokens matter: Token labeling for training better vision transformers. NeurIPS, 2021.
  17. Reformer: The efficient transformer. In ICLR, 2020.
  18. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In ECCV, 2022.
  19. Not all patches are what you need: Expediting vision transformers via token reorganizations. ICLR, 2022.
  20. Tokenmix: Rethinking image mixing for data augmentation in vision transformers. In ECCV, 2023.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  22. Beyond attentive tokens: Incorporating token importance and diversity for efficient vision transformers. In CVPR, 2023.
  23. Sgdr: Stochastic gradient descent with warm restarts. ICLR, 2017.
  24. Token pooling in vision transformers for image classification. In WACV, 2023.
  25. Adavit: Adaptive vision transformers for efficient image recognition. In CVPR, 2022.
  26. IA-RED22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Interpretability-aware redundancy reduction for vision transformers. NeurIPS, 2021.
  27. Learning transferable visual models from natural language supervision. In ICML, 2021.
  28. Dynamicvit: Efficient vision transformers with dynamic token sparsification. NeurIPS, 2021.
  29. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  30. Training data-efficient image transformers & distillation through attention. In ICML, 2021a.
  31. Going deeper with image transformers. In ICCV, 2021b.
  32. Linformer: Self-attention with linear complexity. arXiv:2006.04768, 2020.
  33. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  34. Denoising masked autoencoders help robust classification. In ICLR, 2023.
  35. Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, 2021.
  36. Unsupervised data augmentation for consistency training. NeurIPS, 2020.
  37. Nyströmformer: A nyström-based algorithm for approximating self-attention. In AAAI, 2021.
  38. Co-scale conv-attentional image transformers. In ICCV, 2021.
  39. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In AAAI, 2022.
  40. A-vit: Adaptive tokens for efficient vision transformer. In CVPR, 2022.
  41. Metaformer is actually what you need for vision. In CVPR, 2022.
  42. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 2021.
  43. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019.
  44. Scaling vision transformers. In CVPR, 2022.
  45. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  46. Deformable detr: Deformable transformers for end-to-end object detection. ICLR, 2021.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com