Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 28 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends (2301.05712v4)

Published 13 Jan 2023 in cs.LG

Abstract: Deep supervised learning algorithms typically require a large volume of labeled data to achieve satisfactory performance. However, the process of collecting and labeling such data can be expensive and time-consuming. Self-supervised learning (SSL), a subset of unsupervised learning, aims to learn discriminative features from unlabeled data without relying on human-annotated labels. SSL has garnered significant attention recently, leading to the development of numerous related algorithms. However, there is a dearth of comprehensive studies that elucidate the connections and evolution of different SSL variants. This paper presents a review of diverse SSL methods, encompassing algorithmic aspects, application domains, three key trends, and open research questions. Firstly, we provide a detailed introduction to the motivations behind most SSL algorithms and compare their commonalities and differences. Secondly, we explore representative applications of SSL in domains such as image processing, computer vision, and natural language processing. Lastly, we discuss the three primary trends observed in SSL research and highlight the open questions that remain. A curated collection of valuable resources can be accessed at https://github.com/guijiejie/SSL.

Citations (51)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a comprehensive taxonomy of SSL methods, categorizing context-based, contrastive, and generative approaches.
  • It demonstrates that contrastive methods yield high linear probe performance while masked image modeling excels in transfer tasks.
  • The survey identifies open challenges including theoretical frameworks, automatic pretext optimization, and unified multimodal SSL.

Introduction: Motivation and Context of Self-supervised Learning

Self-supervised learning (SSL) is presented as a robust alternative to supervised learning, motivated by the enormous cost and impracticality of obtaining large labeled datasets in real-world scenarios. SSL leverages pretext tasks—auxiliary objectives constructed from unlabeled data—to enable representation learning without manual annotations. This framework exploits the structural properties and relationships inherent in the data to produce pseudo-labels for pre-training, with the learned representations subsequently fine-tuned for downstream tasks. Figure 1

Figure 1: The overall framework of SSL.

The exponential increase in SSL research output underlines the centrality of the field to the progression of deep learning. The number of related publications has rapidly accelerated, highlighting the community's prioritization of SSL-based techniques for fundamental advances in both computer vision (CV) and NLP. Figure 2

Figure 2: Number of SSL publications per year indicates rapid growth and continued interest in the field.

Taxonomy and Key Algorithms in Self-supervised Learning

The survey categorizes SSL approaches based on the design of pretext tasks and the mathematical properties of their learning signals. The principal methodologies are as follows:

  • Context-based Methods: These models exploit local or global structural correlations (e.g., spatial, color, or patch relationships) within data.
  • Contrastive Learning (CL): These methods learn invariant representations by maximizing agreement between differently augmented views of the same instance, often via contrastive or InfoNCE-losses, with approaches grounded in both negative and positive sample mining.
  • Generative Methods (e.g., Masked Image Modeling, MIM): These reconstruct occluded or masked parts of the input, motivated by pretext tasks like inpainting or denoising, with the masking serving as the self-supervised signal.
  • Contrastive Generative Methods: These hybridize contrastive and generative principles, aiming to benefit from the strengths of both paradigms, particularly for scaling and data-efficient learning. Figure 3

    Figure 3: Schematic showing the distinctions among supervised, unsupervised, and self-supervised learning.

Context-based Pretext Tasks

Canonical context-based tasks include geometric prediction (rotation), spatial arrangement (jigsaw puzzles), and channel reconstruction (colorization). Each task enforces unique constraints on the learned representations, promoting invariance to specific data transformations or dependencies. Figure 4

Figure 4: Examples of context-based pretext tasks: rotation (geometric context), jigsaw (spatial context), and colorization (channel context).

Contrastive Learning (CL) Approaches

Contrastive learning, epitomized by MoCo and SimCLR, relies on instance discrimination through augmentations, maintaining a queue or large batch for negative samples. Recent methods, including BYOL and SimSiam, demonstrate that strong representation learning is possible even in the absence of explicit negative pairs, by leveraging projection heads, stop-gradient tricks, and architectural symmetries. Figure 5

Figure 5: Taxonomy of CL methods: standard negative mining (left), self-distillation (center), and feature decorrelation (right).

Figure 6

Figure 6: Key Siamese architectures in CL: encoder cooperation enables diverse negative/positive strategies.

Masked Image Modeling (MIM) and Generative Algorithms

Recent advances in MIM, such as BEiT, MAE, CAE, and SimMIM, represent a paradigm shift. These methods mask a subset of the input and train the model to reconstruct either raw pixels or tokenized patches, with variants differing in target definition and architectural coupling between encoder and decoder. BEiT leverages tokenization inspired by NLP, whereas MAE directly regresses pixel values without a tokenizer. Figure 7

Figure 7: Comparing the pipelines of CL and MIM methods—highlighting information flow and reconstruction targets.

Representative Pretext Tasks

Figure 8

Figure 8: Notable SSL pretext tasks spanning context-based, contrastive, and generative setups.

Integration with Other Learning Paradigms

The survey underscores the modularity and utility of SSL for enhancing and complementing other learning paradigms:

  • GANs: Embedding self-supervision within GANs (e.g., SS-GAN) facilitates improved discrimination and generation.
  • Semi-supervised Learning: SSL objectives supplement labeled-data losses for regularization (as in S4^4L), bolstering performance in low-label regimens.
  • Multi-modal and Multi-view Learning: SSL is a natural fit for multi-sensory settings, promoting cross-modal consistency via synchronized pretext tasks.

Applications Across Modalities

  • Computer Vision: State-of-the-art SSL approaches yield transferable features for classification, detection, segmentation, and re-identification tasks; SSL for video is enhanced by modeling temporal order, speed, and multi-modal synchronization.
  • Natural Language Processing: SSL underpins foundational models such as BERT and GPT; masked token prediction, autoregression, and other pretext tasks underpin rapid advances in bidirectional and generative LLMing.
  • Medical Imaging & Remote Sensing: SSL addresses the critical bottleneck of sparse labels by leveraging domain-specific pretext tasks, shown to yield robust representations in segmentation, detection, and change tracking.

Comparative Performance Landscape

The survey presents strong empirical evidence that:

  • CL-based SSL methods generally attain superior linear probe performance (for classification), due to well-clustered latent features.
  • MIM approaches, when fine-tuned, surpass contrastive methods on transfer tasks, notably in object detection and segmentation. This robustness is credited to MIM's reduced propensity to overfit and effective use of architectural inductive biases.
  • CL methods are resource-intensive, often requiring momentum encoders, memory queues, and large batch mining, which limits scalability relative to the inherent parallelism of MIM-style training.

Theoretical Advances

The survey notes the lag in theory relative to the empirical proliferation of SSL. The collapse avoidance in non-contrastive methods (BYOL, SimSiam) and the comparative superiority of MIM over CL in certain settings merit rigorous theoretical elucidation.

Automatic Pretext Optimization

Automating the selection and composition of pretext tasks, tailored to maximize transfer learning performance for downstream applications, remains formally unresolved. The field warrants further paper in data-driven or meta-learning-based task design.

Towards Unified Multimodal SSL

A compelling research trajectory is the construction of unified SSL paradigms that span modalities, architectures, and data types—particularly via transformer backbones, with the potential to yield cross-domain, foundation models scalable to vision, language, and beyond.

Scaling Laws and Data Utilization

While SSL in principle can exploit limitless unlabeled data, identifying regimes and architectures that consistently benefit from data scaling is nontrivial. The data efficiency and scaling behavior mismatch between generative and contrastive approaches remains to be systematized and theoretically explained.

Failure Modes and Guidance for Practitioners

The field increasingly recognizes that more unlabeled data is not universally beneficial, especially in semi-supervised settings or under distributional mismatch. Developing diagnostics and recommendations for algorithm selection, based on problem statistics and failure mode analysis, constitutes an actionable research agenda.

Conclusion

This survey establishes a rigorous taxonomy and critical synthesis of self-supervised learning algorithms, with detailed coverage of architectural strategies, integration with other machine learning paradigms, and the breadth of real-world applications. Key numerical findings include the superior linear probe results of contrastive approaches and the robust fine-tuning performance of masked image modeling methods, challenging prior assumptions about task-agnostic transferability. The work highlights open theoretical questions, the necessity for automatic and multimodal pretext engineering, and the nuanced relationship between data scaling and SSL performance. Self-supervised learning is poised to continue serving both as a theoretical frontier and a practical engine for progress across modalities and domains in AI.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub