Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Morphing Tokens Draw Strong Masked Image Models (2401.00254v4)

Published 30 Dec 2023 in cs.CV

Abstract: Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs). MIMs predict masked tokens token-wise to recover target signals that are tokenized from images or generated by pre-trained models like vision-LLMs. While using tokenizers or pre-trained models is viable, they often offer spatially inconsistent supervision even for neighboring tokens, hindering models from learning discriminative representations. Our pilot study identifies spatial inconsistency in supervisory signals and suggests that addressing it can improve representation learning. Building upon this insight, we introduce Dynamic Token Morphing (DTM), a novel method that dynamically aggregates tokens while preserving context to generate contextualized targets, thereby likely reducing spatial inconsistency. DTM is compatible with various SSL frameworks; we showcase significantly improved MIM results, barely introducing extra training costs. Our method facilitates MIM training by using more spatially consistent targets, resulting in improved training trends as evidenced by lower losses. Experiments on ImageNet-1K and ADE20K demonstrate DTM's superiority, which surpasses complex state-of-the-art MIM methods. Furthermore, the evaluation of transfer learning on downstream tasks like iNaturalist, along with extensive empirical studies, supports DTM's effectiveness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. John Wiley & Sons, Ltd, 1990.
  2. Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141, 2022.
  3. data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR, 2022.
  4. Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2021.
  5. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision, 2021.
  7. Mixed autoencoder for self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR), pages 22742–22751, 2023.
  8. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  9. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
  10. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022a.
  11. Sdae: Self-distillated masked autoencoder. In European Conference on Computer Vision, pages 108–124. Springer, 2022b.
  12. Bootstrapped masked autoencoders for vision bert pretraining. arXiv preprint arXiv:2207.07116, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  14. Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740, 2021.
  15. Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning. In Advances in Neural Information Processing Systems, pages 21271–21284. Curran Associates, Inc., 2020.
  16. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
  17. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
  18. Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532, 2022.
  19. An optimal algorithm for on-line bipartite matching. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 352–358, 1990.
  20. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2661–2671, 2019.
  21. Semmae: Semantic-guided masking for learning masked autoencoders. arXiv preprint arXiv:2206.10207, 2022a.
  22. mc-beit: Multi-choice discretization for image bert pre-training. In European conference on computer vision, 2022b.
  23. EVit: Expediting vision transformers via token reorganizations. In International Conference on Learning Representations, 2022.
  24. S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
  25. A simple, efficient and scalable contrastive masked autoencoder for learning visual representations. arXiv preprint arXiv:2210.16870, 2022.
  26. Slip: Self-supervision meets language-image pre-training. arXiv preprint arXiv:2112.12750, 2021.
  27. Less is more: Pay less attention in vision transformers. In AAAI, 2022.
  28. BEiT v2: Masked image modeling with vector-quantized visual tokenizers. 2022.
  29. Learning transferable visual models from natural language supervision. In ICML, 2021.
  30. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  31. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
  32. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, 2019.
  33. Deepmim: Deep supervision for masked image modeling. arXiv preprint arXiv:2303.08817, 2023.
  34. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  35. Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12165–12174, 2022.
  36. Siamese image modeling for self-supervised vision representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2132–2141, 2023.
  37. Training data-efficient image transformers &distillation through attention. In International Conference on Machine Learning, pages 10347–10357, 2021.
  38. Deit iii: Revenge of the vit. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer, 2022.
  39. Co-training 2l submodels for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11701–11710, 2023.
  40. The inaturalist species classification and detection dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8769–8778, 2018.
  41. Hard patches mining for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (CVPR), 2023.
  42. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
  43. Mvp: Multimodality-guided visual pre-training. In European Conference on Computer Vision, pages 337–353. Springer, 2022b.
  44. Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. arXiv preprint arXiv:2205.14141, 2022c.
  45. Extreme masking for learning instance and distributed visual representations. arXiv preprint arXiv:2206.04667, 2022.
  46. Unified perceptual parsing for scene understanding. In European Conference on Computer Vision. Springer, 2018.
  47. Simmim: A simple framework for masked image modeling. In International Conference on Computer Vision, 2022.
  48. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2964–2972, 2022.
  49. Masked image modeling with denoising contrast. International Conference on Learning Representations, 2023.
  50. Scene parsing through ade20k dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  51. ibot: Image bert pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.