Improving Pixel-based MIM by Reducing Wasted Modeling Capability (2308.00261v1)

Published 1 Aug 2023 in cs.CV

Abstract: There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2\% on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation. Code and models are available at https://github.com/open-mmlab/mmpretrain.

Citations (20)

View on Semantic Scholar

Summary

The paper demonstrates a novel multi-level feature fusion strategy that incorporates shallow-layer features to reduce high-frequency bias in pixel-based MIM.
It reports performance boosts of 1.2% in fine-tuning, 2.8% in linear probing, and 2.6% in semantic segmentation, particularly in smaller ViT architectures.
The approach flattens the loss landscape and balances feature learning, narrowing the gap between pixel-based methods and tokenizer-based frameworks.

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

This paper addresses the limitations of pixel-based Masked Image Modeling (MIM), a self-supervised learning (SSL) approach in computer vision. Pixel-based MIM, while computationally efficient, tends to focus excessively on high-frequency details due to its objective of reconstructing raw pixel values. The authors propose a novel method to mitigate this issue by incorporating multi-level feature fusion, enabling models to utilize low-level features from shallow layers to enhance pixel-based reconstruction tasks.

Methodology

The authors categorize existing MIM approaches into pixel-based and tokenizer-based frameworks. While the former offers lower computational costs, it exhibits biases towards features capturing high-frequency components. Anchored on this observation, the paper introduces a multi-level feature fusion strategy to integrate shallow layer features into the pixel reconstruction task, thereby improving the convergence and expressiveness of the underlying model, such as the Vision Transformer (ViT).

Experimental findings reveal that these modifications yield considerable performance gains, particularly in smaller architectures like ViT-S. Notable improvements were observed in fine-tuning (1.2%), linear probing (2.8%), and semantic segmentation (2.6%), showcasing the method's efficacy in various downstream tasks.

Key Contributions and Experiments

The paper's core contributions include:

Empirical Analysis: Demonstrating the inherent focus of pixel-based MIM methods on high-frequency components and proposing a corrective strategy through empirical studies.
Fusion Strategy Implementation: Introducing a multi-level feature fusion technique, which involves dynamically integrating shallow layer features across training iterations. This approach optimizes the model’s capacity to capture more comprehensive semantic representations.
Extensive Evaluation: Validating the method's effectiveness via comparative analysis with existing MIM strategies and exploring robustness through OOD datasets such as ImageNet-C and ImageNet-R.
Optimization Insights: Highlighting how the proposed solution flattens the loss landscape and modifies the frequency distribution in latent feature representations, resulting in more balanced and robust feature learning.

Implications and Future Directions

The reduction in wasted modeling capacity through multi-level feature fusion does not only enhance pixel-based MIM’s performance but also narrows the gap between pixel-based approaches and those utilizing pre-trained tokenizers. This innovation has practical significance, potentially lowering computational demands while improving model robustness and efficiency.

Theoretically, this work extends the understanding of feature-level integration in SSL, positioning it as a fundamental aspect of improving pixel-based methodologies. It encourages further exploration into architectural adjustments that can capitalize on readily available image features, thus broadening the scope and application of MIM frameworks.

Future research might focus on refining the selection process of beneficial features across layers or incorporating these insights into alternative MIM models and architectures. The trajectory of such advancements may push the envelope in SSL's applicability across diverse and complex visual tasks, making them more accessible and efficient.