Rethinking Patch Dependence for Masked Autoencoders (2401.14391v2)
Abstract: In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Multimae: Multi-modal multi-task masked autoencoders. arXiv:2204.01678, 2022a.
- Multimae: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision, pages 348–367. Springer, 2022b.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Beit: Bert pre-training of image transformers. In ICLR, 2022.
- Language models are few-shot learners. 2020.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Mixed autoencoder for self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22742–22751, 2023.
- Generative pretraining from pixels. 2020a.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020b.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020c.
- An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
- Per-pixel classification is not all you need for semantic segmentation. 2021.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
- Randaugment: Practical automated data augmentation with a reduced search space. arxiv e-prints, page. arXiv preprint arXiv:1909.13719, 4, 2019.
- Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Corrupted image modeling for self-supervised visual pre-training. In The Eleventh International Conference on Learning Representations, 2023.
- Masked autoencoders as spatiotemporal learners. In Advances in Neural Information Processing Systems, 2022.
- Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204, 2022.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017a.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017b.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
- Siamese masked autoencoders. arXiv preprint arXiv:2305.14344, 2023.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
- Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.
- Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
- Progressively compressed auto-encoder for self-supervised representation learning. In The Eleventh International Conference on Learning Representations, 2023a.
- Mage: Masked generative encoder to unify representation learning and image synthesis. arXiv preprint arXiv:2211.09117, 2022a.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022b.
- Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23390–23400, 2023b.
- Mixmae: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. arXiv:2205.13137, 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Sgdr: Stochastic gradient descent with warm restarts. 2017a.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017b.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023.
- How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research, 2022.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Augmenting convolutional networks with attention-based aggregation, 2021.
- Pixeltransformer: Sample conditioned signal generation. In Proceedings of the 38th International Conference on Machine Learning, pages 10455–10464. PMLR, 2021.
- Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
- Attention is all you need. 2017.
- Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
- Unsupervised feature learning by cross-level instance-group discrimination. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12586–12595, 2021.
- Diffusion models as masked autoencoder. In ICCV, 2023.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.