What Do Self-Supervised Vision Transformers Learn?

Published 1 May 2023 in cs.CV, cs.AI, and cs.LG | (2305.00729v1)

Abstract: We present a comparative study on how and why contrastive learning (CL) and masked image modeling (MIM) differ in their representations and in their performance of downstream tasks. In particular, we demonstrate that self-supervised Vision Transformers (ViTs) have the following properties: (1) CL trains self-attentions to capture longer-range global patterns than MIM, such as the shape of an object, especially in the later layers of the ViT architecture. This CL property helps ViTs linearly separate images in their representation spaces. However, it also makes the self-attentions collapse into homogeneity for all query tokens and heads. Such homogeneity of self-attention reduces the diversity of representations, worsening scalability and dense prediction performance. (2) CL utilizes the low-frequency signals of the representations, but MIM utilizes high-frequencies. Since low- and high-frequency information respectively represent shapes and textures, CL is more shape-oriented and MIM more texture-oriented. (3) CL plays a crucial role in the later layers, while MIM mainly focuses on the early layers. Upon these analyses, we find that CL and MIM can complement each other and observe that even the simplest harmonization can help leverage the advantages of both methods. The code is available at https://github.com/naver-ai/cl-vs-mim.

Abstract PDF Upgrade to Chat

Citations (59)

View on Semantic Scholar

Summary

The paper reveals that CL and MIM learn distinct representations, with CL capturing global shapes and MIM emphasizing local textures.
The study finds that CL impacts later ViT layers and is optimal for linear probing, while MIM influences early layers for fine-tuning and dense prediction.
Empirical evaluations demonstrate that combining CL and MIM can boost performance across diverse computer vision tasks.

Insights into Self-Supervised Vision Transformers: A Comparative Analysis of CL and MIM

This paper presents a detailed comparative analysis of two prominent self-supervised learning methods for Vision Transformers (ViTs): Contrastive Learning (CL) and Masked Image Modeling (MIM). The authors aim to uncover the distinctive learning mechanisms of these methods and their performances in various downstream tasks. By examining the self-attention properties, representation transformations, and key components of these methodologies, the study contributes to a comprehensive understanding of how CL and MIM shape the learning paradigm of ViTs.

Methodological Distinctions

One of the core findings is that CL and MIM diverge fundamentally in terms of representation. CL is found to promote the learning of global patterns through the self-attention mechanism, enabling the ViT to capture object shapes efficiently. This global perspective is particularly advantageous for linear separation of image representations. However, it comes at a cost, as the homogeneity of self-attention reduces diversity among token representations, thereby limiting scalability and adversely affecting tasks that require dense prediction.

In contrast, MIM learns through a more localized approach, focusing on reconstructing the semantics of masked input patches. This method emphasizes high-frequency signal features, which correlate with textures, in contrast to the shape-oriented low-frequency signals utilized by CL. The implications are clear: while MIM excels in texture-based tasks such as fine-tuning and dense prediction, CL outperforms in linear probing tasks due to its shape-focus.

Key Findings and Architectural Implications

Through extensive experimentation, it becomes evident that CL and MIM operate optimally at different hierarchical layers of the ViT architecture. CL exerts a significant influence on the later layers, where global features and object integrity matter more. Conversely, MIM emphasizes the early layers, capturing low-level textures and local patterns effectively. This hierarchical difference underscores the complementary potential of these methods when combined, as the study demonstrates.

The study also explores the potential of hybrid models that incorporate both CL and MIM objectives. The simplistic linear combination of these seemingly opposing methods yielded improved performance over either method alone. This suggests a promising avenue for achieving robust model architectures that leverage the strengths of both global and local feature learning.

Numerical and Empirical Assessments

The paper supports its analysis with empirical results, demonstrating substantial differences in performance across tasks and model sizes. CL achieves superior linear probing accuracy, particularly with small models, while MIM excels in fine-tuning, large model scalability, and dense prediction tasks. These observations are further corroborated by evaluating standardized benchmarks like ImageNet and various dataset configurations.

Future Prospects

The insights gained pave the way for future research directions. A potential avenue of exploration is the development of novel self-supervised learning paradigms that dynamically integrate the strengths of CL and MIM across different layers of ViTs. Another aspect could be the adaptation of these findings to multi-stage ViTs and other complex architectures. Additionally, fine-tuning individual properties of CL and MIM to enhance shape or texture recognition could lead to further performance gains.

Conclusion

In conclusion, this comparative study sheds light on the inherently different yet complementary learning frameworks of CL and MIM in self-supervised ViTs. The findings have significant implications for both theoretical understanding and practical applications in the field of computer vision, suggesting that the integration of these methods could lead to more versatile and capable vision models.

Markdown