Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

Published 9 Mar 2023 in cs.CV | (2303.05475v1)

Abstract: Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Code and pre-trained models will be released at https://github.com/Alpha-VL/ConvMAE.

Abstract PDF Upgrade to Chat

Citations (14)

View on Semantic Scholar

Summary

The paper introduces the MR-MAE framework, which applies high-level feature mimicking before pixel reconstruction to enhance semantic guidance.
It employs a dual target approach by combining mimic loss for visible tokens and reconstruction loss for masked tokens to resolve training conflicts.
Results demonstrate significant performance improvements on ImageNet-1K and COCO, achieving faster convergence and superior accuracy compared to baseline models.

Enhancing Masked Autoencoders with Feature Mimicking: MR-MAE

The paper presents a novel framework, MR-MAE, aiming to enhance Masked Autoencoders (MAE) by integrating high-level feature mimicking before pixel reconstruction. Masked Autoencoders have gained prominence for large-scale vision representation, yet they exhibit limitations due to the lack of high-level semantic guidance during pre-training. MR-MAE addresses these limitations by incorporating feature mimicry from pre-trained models such as CLIP and DINO, allowing a more comprehensive learning of both high-level semantics and low-level textures without conflict.

Key Contributions

Mimic Before Reconstruct: MR-MAE introduces a straightforward yet impactful strategy where a mimic loss is applied to the visible tokens from the encoder, aligning the representations with high-level features from CLIP or DINO. This approach derives semantic guidance from the beginning, unlike traditional MAE, which relies solely on pixel-level reconstruction.
Dual Target Approach: By deploying a reconstruction loss for masked tokens and a mimic loss for visible ones, MR-MAE circumvents the conflicts that arise from concurrent high-level and low-level training objectives.
Significant Improvement in Performance: On the ImageNet-1K dataset, MR-MAE achieves a top-1 accuracy of 85.8% after 400 pre-training epochs, surpassing the original MAE by +2.2% and state-of-the-art models such as BEiT V2 by +0.3%.

Experimental Validation

The efficacy of MR-MAE is evidenced through extensive experiments in image classification and object detection:

ImageNet-1K Fine-tuning: With a pre-training duration of only 400 epochs, MR-MAE achieves superior accuracy compared to models that typically require 1600 epochs, demonstrating faster convergence and enhanced learning potential.
COCO Object Detection: The MR-MAE backbone for Mask-RCNN achieves 53.4% in box AP score with only 25 fine-tuning epochs, indicating robust transferability of the learned representations.

Technical Insights

MR-MAE utilizes several technological augmentations:

Focused Mimicking and Multi-layer Fusion: The model intelligently selects salient tokens for mimicry and integrates multiple layers for coherent feature fusion, reinforcing the encoder’s ability to capture high-level semantics.
Incorporation of Multi-scale Architectures: By adopting masked convolution stages, MR-MAE captures hierarchical representations, further solidifying its performance in downstream tasks.

Implications and Future Directions

The approach exemplifies how high-level feature integration can significantly elevate the performance of generative pre-training models. MR-MAE’s framework paves the way for efficient scaling, reduced pre-training times, and the potential for even broader applications in nuanced visual tasks. It invites future exploration into more sophisticated integration of diverse high-level features, potentially leveraging multiple pre-trained models to offer a richer semantic landscape.

Conclusion

MR-MAE represents a meaningful advancement in the domain of vision transformers, effectively merging low-level and high-level training targets to derive a more comprehensive understanding of visual data. By harnessing pre-existing high-level information encoded in models like CLIP and DINO, MR-MAE sets a precedent for future innovations in feature distillation and representation learning.