SimCSE: Simple Contrastive Learning of Sentence Embeddings (2104.08821v4)

Published 18 Apr 2021 in cs.CL and cs.LG

Abstract: This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.

Citations (2,926)

View on Semantic Scholar

Summary

The paper introduces SimCSE, a contrastive learning framework that significantly improves sentence embedding performance over previous methods.
The unsupervised model employs dropout-based augmentation to create positive pairs, achieving a 76.3% Spearman’s correlation on STS tasks with BERTbase.
The supervised variant leverages NLI datasets to form entailment and contradiction pairs, reaching an 81.6% Spearman’s correlation and enhancing semantic alignment.

An Overview of SimCSE: Simple Contrastive Learning of Sentence Embeddings

The presented paper, SimCSE: Simple Contrastive Learning of Sentence Embeddings, introduces the SimCSE framework which offers a concise yet effective method for advancing state-of-the-art sentence embeddings using contrastive learning. The authors, Tianyu Gao, Xingcheng Yao, and Danqi Chen, primarily focus on two approaches within SimCSE: an unsupervised model and a supervised model leveraging Natural Language Inference (NLI) datasets.

Unsupervised SimCSE

The unsupervised approach of SimCSE offers an elegant solution by predicting an input sentence itself in a contrastive manner. The model utilizes standard dropout as the only form of data augmentation, making the process strikingly straightforward. Essentially, each sentence is processed twice with different dropout masks, treating these varied passes as positive pairs while considering other sentences in the mini-batch as negative examples. This method achieves remarkable results by exploiting the regularization effects of dropout to maintain a high degree of uniformity in the sentence embedding space, while still preserving alignment among positive pairs.

A salient feature of the unsupervised SimCSE is its impressive performance without any labeled data, achieving a 76.3% average Spearman’s correlation on standard STS tasks using BERTbase, which is a significant 4.2% enhancement over the preceding best unsupervised results. This establishes unsupervised SimCSE as an attractive method for generating high-quality sentence embeddings from large-scale unlabeled corpora.

Supervised SimCSE

Contrasting the simplicity of the unsupervised model, the supervised variant of SimCSE incorporates additional structure from NLI datasets. The supervised approach reformulates the contrastive learning framework to leverage entailment pairs as positive examples and contradiction pairs as hard negatives. This incorporation of supervised signals is particularly effective; by utilizing the explicit structure inherent in NLI pairs, the model further improves alignment between semantically related sentence pairs.

Quantitatively, the supervised SimCSE model achieves an average Spearman’s correlation of 81.6% on STS tasks with BERTbase, a noticeable 2.2% gain over the best previous supervised results. This demonstrates the efficacy of combining contrastive learning objectives with high-quality, annotated sentence pairs from NLI datasets.

Theoretical and Empirical Analysis

The paper also provides a comprehensive theoretical analysis of the effectiveness of the SimCSE objectives in regularizing the embedding space. The authors employ alignment and uniformity metrics to quantitatively measure how well the learned embeddings capture semantic similarity while maintaining an evenly spread representational space. Empirical findings confirm that unsupervised SimCSE substantially enhances uniformity without sacrificing alignment, while the supervised model further optimizes alignment due to the guidance of labeled data.

An additional connection is drawn to recent discoveries regarding the anisotropic nature of pre-trained embeddings. The contrastive objectives in SimCSE inherently address this by 'flattening' the singular value distribution of the embedding space, leading to more isotropic and hence more expressive representations.

Implications and Future Directions

Practically, SimCSE’s simplicity and effectiveness imply broad applicability in NLP systems requiring robust sentence embeddings. The minimalist approach of dropout-based augmentation reduces computational complexity and avoids the pitfalls of more complex data augmentation techniques. The supervised variant introduces a method of utilizing NLI datasets which may be adapted for other supervised learning tasks within the NLP domain.

Theoretically, the introduction of SimCSE provides a strong foundation for future research in unsupervised and semi-supervised learning paradigms. The blend of contrastive learning with minimal augmentation opens avenues for exploring other forms of lightweight data augmentation suited to different types of language tasks.

Conclusion

SimCSE stands as a significant contribution to the field of sentence embeddings, presenting simple yet powerful mechanisms to achieve high-performance semantic representations. Both its unsupervised and supervised models push the boundaries of current methodologies, offering practical and theoretical advancements. Future research may build on this foundation to explore novel improvements in sentence embedding techniques, further enhancing their utility across various NLP applications.

PDF Markdown

Related Papers

YouTube

Show All Videos