Papers
Topics
Authors
Recent
2000 character limit reached

Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? (2201.05119v2)

Published 13 Jan 2022 in cs.CV, cs.LG, and stat.ML

Abstract: Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from ReLIC [Mitrovic et al., 2021], we include additional inductive biases into self-supervised learning. We propose a new self-supervised representation learning method, ReLICv2, which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views to avoid learning spurious correlations and obtain more informative representations. ReLICv2 achieves $77.1\%$ top-$1$ accuracy on ImageNet under linear evaluation on a ResNet50, thus improving the previous state-of-the-art by absolute $+1.5\%$; on larger ResNet models, ReLICv2 achieves up to $80.6\%$ outperforming previous self-supervised approaches with margins up to $+2.3\%$. Most notably, ReLICv2 is the first unsupervised representation learning method to consistently outperform the supervised baseline in a like-for-like comparison over a range of ResNet architectures. Using ReLICv2, we also learn more robust and transferable representations that generalize better out-of-distribution than previous work, both on image classification and semantic segmentation. Finally, we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.

Citations (75)

Summary

  • The paper introduces a self-supervised framework using ReLIC on ResNets that achieves up to 80.6% top-1 accuracy on ImageNet.
  • It employs saliency masking and multiple image views to enhance robustness and avoid spurious correlations.
  • The study demonstrates that self-supervised learning can exceed traditional methods, opening avenues for scalable vision applications.

Evaluation of Self-Supervised ResNets on ImageNet

In the field of computer vision, representation learning without reliance on labeled data is increasingly important, especially for tasks where labels are scarce or expensive. This paper evaluates a self-supervised technique called ReLIC applied to residual networks (ResNets) for learning image representations on the ImageNet dataset. The fundamental question at the heart of this research is whether it is possible to surpass supervised learning performance on ImageNet using self-supervised methods.

Methodology: Saliency Masking and Multi-View Learning

The method builds on prior theoretical insights from ReLIC, introducing explicit invariance losses combined with contrastive objectives across varied data views. This strategy helps avoid learning spurious correlations and improves the informativeness of representations. Saliency masking forms a core part of the data augmentation process, which helps focus learning on foreground features, thereby enhancing robustness to background changes. Figure 1

Figure 1: ReLIC employs saliency masking and various-sized views to enforce invariance against spurious correlations.

The method operates on multiple image views of different sizes, contrasting against previous methodologies utilizing a single standardized view size. By leveraging smaller views alongside large ones, it ensures robust feature learning even under partial occlusion, a common real-world issue.

Performance on ImageNet

Remarkably, ReLIC achieves up to 80.6% top-1 accuracy on ImageNet with large ResNet models, consistently outperforming supervised baselines across different architectures. This contrasts with traditional supervised methods, which remain limited by label-based information. Figure 2

Figure 2: Transfer performance of ReLIC significantly exceeds supervised baselines, showcasing the method's robustness and adaptability.

Implications and Future Directions

From a theoretical standpoint, ReLIC redefines goals by demonstrating that self-supervised approaches can exceed conventional supervised learning results. Practically, this opens avenues for deploying these advanced self-supervised models in high-level vision tasks without the cost of acquiring vast labeled datasets.

Additionally, the transferability and robustness of the learned representations suggest a potential shift in how foundational models are trained—transforming tasks such as semantic segmentation and out-of-distribution generalization. Figure 3

Figure 3: Unsupervised saliency masks effectively isolate semantically relevant portions of images, enhancing focus on critical features.

Conclusion

The paper exemplifies critical advancements in self-supervised learning by deploying ReLIC, paving the way for superior representation learning capable of outperforming traditional methodologies. Future research will likely focus on refining these techniques further, possibly integrating them with emerging architectures such as Vision Transformers, thus broadening the applicability and performance of self-supervised systems across diverse domains.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.