Rethinking Patch Dependence for Masked Autoencoders (2401.14391v2)
Abstract: In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Multimae: Multi-modal multi-task masked autoencoders. arXiv:2204.01678, 2022a.
- Multimae: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision, pages 348–367. Springer, 2022b.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Beit: Bert pre-training of image transformers. In ICLR, 2022.
- Language models are few-shot learners. 2020.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Mixed autoencoder for self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22742–22751, 2023.
- Generative pretraining from pixels. 2020a.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020b.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020c.
- An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
- Per-pixel classification is not all you need for semantic segmentation. 2021.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
- Randaugment: Practical automated data augmentation with a reduced search space. arxiv e-prints, page. arXiv preprint arXiv:1909.13719, 4, 2019.
- Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Corrupted image modeling for self-supervised visual pre-training. In The Eleventh International Conference on Learning Representations, 2023.
- Masked autoencoders as spatiotemporal learners. In Advances in Neural Information Processing Systems, 2022.
- Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204, 2022.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017a.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017b.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
- Siamese masked autoencoders. arXiv preprint arXiv:2305.14344, 2023.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
- Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020.
- Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
- Progressively compressed auto-encoder for self-supervised representation learning. In The Eleventh International Conference on Learning Representations, 2023a.
- Mage: Masked generative encoder to unify representation learning and image synthesis. arXiv preprint arXiv:2211.09117, 2022a.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022b.
- Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23390–23400, 2023b.
- Mixmae: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers. arXiv:2205.13137, 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Sgdr: Stochastic gradient descent with warm restarts. 2017a.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017b.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. 2019.
- Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023.
- How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research, 2022.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Augmenting convolutional networks with attention-based aggregation, 2021.
- Pixeltransformer: Sample conditioned signal generation. In Proceedings of the 38th International Conference on Machine Learning, pages 10455–10464. PMLR, 2021.
- Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
- Attention is all you need. 2017.
- Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
- Unsupervised feature learning by cross-level instance-group discrimination. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12586–12595, 2021.
- Diffusion models as masked autoencoder. In ICCV, 2023.
- Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.