Exploring Simple Siamese Representation Learning

Published 20 Nov 2020 in cs.CV and cs.LG | (2011.10566v1)

Abstract: Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.

Abstract PDF Upgrade to Chat

Citations (3,620)

View on Semantic Scholar

Summary

The paper demonstrates that simple weight-sharing Siamese networks can learn effective image representations through a novel stop-gradient strategy to prevent collapse.
It employs a dual-branch architecture with an encoder and prediction MLP, achieving competitive ImageNet performance with 67.7%-68.1% linear evaluation accuracy.
The study reveals that eliminating complex elements like large batches and momentum encoders can lead to efficient and robust unsupervised visual learning.

Exploring Simple Siamese Representation Learning

The paper "Exploring Simple Siamese Representation Learning" by Xinlei Chen and Kaiming He from Facebook AI Research investigates the efficacy of Siamese network architectures for unsupervised visual representation learning without employing commonly used strategies such as negative sample pairs, large batch sizes, and momentum encoders. Their approach, dubbed SimSiam, demonstrates that simple weight-sharing neural networks can produce meaningful image representations efficiently.

Context and Motivation

Siamese networks have been the cornerstone of several recent advances in unsupervised visual representation learning. Traditionally, methods like contrastive learning (e.g., SimCLR) mitigate collapsing solutions by repulsing negative sample pairs and thus require large batch sizes to include sufficient negative pairs. Alternatively, methods like BYOL warrant a momentum encoder to prevent collapse. Furthermore, methods like SwAV introduce clustering mechanisms to address this issue. The primary motivation behind this paper is to streamline these methods and demonstrate that competitive performance can be achieved using a much simpler approach.

Methodology

SimSiam combines several novel components to ensure robust learning without collapse:

Siamese Architecture: The network has two identical branches sharing the same weights, which process two augmented views of an input image. Each branch consists of an encoder network, followed by a projection Multi-Layer Perceptron (MLP).
Prediction MLP: One branch further includes a prediction MLP to transform the encoder's output before measuring similarity with the other branch's processed view.
Stop-Gradient Operation: A crucial component in SimSiam is the stop-gradient operation applied to one of the branch outputs. This prevents collapsing.

The main loss function aims to maximize the cosine similarity between the two network outputs while only updating one branch, due to the stop-gradient. This asymmetric design prevents trivial constant solutions by implicitly introducing alternating optimization behavior similar to Expectation-Maximization (EM) algorithms.

Empirical Results

The experiments conducted on the ImageNet dataset showed that SimSiam achieves competitive performance in unsupervised representation learning. SimSiam acquired a validation accuracy of 67.7% in a linear evaluation on ImageNet, outperforming several established methods under comparable training conditions.

Key numerical results include:

With a batch size of 512, SimSiam achieved a 68.1% linear evaluation accuracy after 100 epochs of training.
Comparative studies with SimCLR, MoCo v2, BYOL, and SwAV indicated that SimSiam performs on par or better while using simpler mechanisms.
It was robust to various settings of batch size, performing well even with relatively smaller batches (e.g., 128).

Hypothesis and Analysis

The authors hypothesize that the effectiveness of SimSiam comes from its implicit optimization problem, which resembles an EM algorithm. The stop-gradient operation divides the learning problem into two subproblems iteratively solved: updating network parameters and computing image representations. The alternating updates prevent the representations from collapsing.

The prediction MLP was found to be necessary for optimization rather than collapse prevention. Without this, the model failed to converge effectively. Additionally, the method was shown to be robust irrespective of batch normalization configurations, further emphasizing the role of the stop-gradient mechanism.

Implications and Future Directions

The success of SimSiam highlights the potential to simplify representation learning without compromising performance. This finding suggests that the intrinsic properties of Siamese architectures might be inherently sufficient to induce robust representations, promoting a reevaluation of the complexity added by mechanisms such as large batches and momentum encoders in previous studies.

Future research could focus on further theoretical understanding of why stop-gradient works so effectively in preventing collapse. Additionally, exploring more comprehensive comparisons with other forms of alternating algorithms and different modes of representation learning can deepen the insight into unsupervised learning frameworks.

SimSiam's simplicity and robustness potentially pave the way for more computationally efficient and lighter models for unsupervised visual learning tasks, making this method a standard reference point for future algorithmic innovations in this domain.

Markdown Report Issue