Rethinking Patch Dependence for Masked Autoencoders (2401.14391v2)

Published 25 Jan 2024 in cs.CV

Abstract: In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io

References (62)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a decoding mechanism that eliminates self-attention among masked patches, reducing computational load without sacrificing performance.
It proposes partial reconstruction and dynamic decoder design that leverages multi-level encoder features to enhance representation learning.
Empirical results reveal a 2.5 to 3.7-fold reduction in computation on ImageNet and COCO tasks, attesting to its scalability and efficiency.

Introduction

Masked Autoencoders (MAE) have gained prominence in unsupervised learning for computer vision, offering efficient pre-training of large-scale models. These models traditionally employ multi-headed self-attention throughout, wherein both visible and masked tokens exchange information. However, recent empirical evidence raises questions about the necessity of this self-attention paradigm, specifically the role of self-attention among the masked patches. CrossMAE, introduced in the reviewed paper, critically analyses and restructures the decoding mechanism of MAE, with a focus on efficiency and representation quality.

Analysis of Mask Tokens

The core inquiry starts with the attention mechanism applied in the decoder of MAE, distinguishing between self-attention among masked tokens and cross-attention where masked tokens attend to visible ones. CrossMAE's preliminary results indicate that masked tokens disproportionately attend to visible tokens rather than other masked tokens, suggesting that the self-attention among masked patches may not contribute significantly to the quality of the learned representation. This observation raises important questions: is self-attention among mask tokens necessary for effective representation learning, and can decoders be designed to reconstruct only a partial set of masked patches without diminishing performance?

CrossMAE Design

CrossMAE proposes a novel framework leveraging only cross-attention for masked patch reconstruction, removing the necessity for masked tokens to attend to each other. This adjustment significantly reduces computational resources in the decoding process, while empirically no decrease in downstream task performance is observed. In addition to restricting attention, CrossMAE introduces partial reconstruction, which allows independent decoding of each masked token, creating the option to reconstruct a subset of masked patches. Moreover, CrossMAE's decoder operates dynamically, utilizing different features from various encoder blocks for each decoding block, which contrasts MAE's static usage of the last encoder feature map. These combined strategies lead to substantial improvements in decoding efficiency and suggest an enhanced capacity for representation learning.

Empirical Validation and Efficiency Gains

The CrossMAE framework, with its evolved architecture, undergoes rigorous empirical validation. It demonstrates its capability to match or surpass MAE in performance metrics while simultaneously achieving a reduction in decoding computation by a factor of 2.5 to 3.7. On the ImageNet classification and COCO instance segmentation tasks, CrossMAE shows superior performance under identical computational constraints. Additionally, models trained under the CrossMAE framework scale favorably, suggesting that the disentangled design has implications for further scalability improvements in representation learning tasks.

Conclusion

The CrossMAE paper presents compelling evidence that the decoding process of masked autoencoders can be made more computationally efficient without compromising the representational abilities of the model. The findings encourage revisiting the role of self-attention in vision pre-training. The improved design lends itself to scaling across longer input sequences, therefore expanding the field of tasks and datasets that can be effectively processed. With CrossMAE, we step into an era where efficient pre-training of visual representations is not only desirable but eminently achievable.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/arankomatsuzaki/status/1750698332958171337

https://twitter.com/_akhaliq/status/1750733128648298898

https://twitter.com/letian_fu/status/1750741949374681139

https://twitter.com/fly51fly/status/1751732127857512706

https://twitter.com/TheTuringPost/status/1752337884957622741

https://twitter.com/arxivsanitybot/status/1750872925773922601

YouTube

Show All Videos