Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning

Published 3 Nov 2020 in cs.CL and cs.LG | (2011.01403v3)

Abstract: State-of-the-art natural language understanding classification models follow two-stages: pre-training a LLM on an auxiliary task, and then fine-tuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, our proposed SCL loss obtains significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in few-shot learning settings, without requiring specialized architecture, data augmentations, memory banks, or additional unsupervised data. Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data, and can generalize better to related tasks with limited labeled data.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (455)

View on Semantic Scholar

Summary

The paper demonstrates that integrating supervised contrastive loss with cross-entropy improves model stability and boosts few-shot performance.
The method employs a weighted SCL term with hyperparameters like temperature and lambda to refine representation clustering.
Empirical results on GLUE benchmarks show significant accuracy gains in noisy settings and improved generalization in transfer tasks.

Supervised Contrastive Learning for Pre-trained LLM Fine-tuning

In this paper, the authors propose an innovative approach to fine-tuning pre-trained LLMs by integrating a supervised contrastive learning (SCL) objective alongside the conventional cross-entropy (CE) loss. It addresses several inherent shortcomings of CE, such as sub-optimal generalization and instability, particularly notable in few-shot learning settings.

Approach and Methodology

The proposed objective leverages a novel SCL term designed to refine representation learning by encouraging samples of the same class to cluster closer together while pushing samples of different classes apart. This method draws inspiration from successful self-supervised learning strategies in other domains but innovatively applies it in a supervised context for NLP tasks. The SCL term is incorporated with CE loss in a weighted fashion, where the tuning of hyperparameters, such as the temperature parameter τ and the weighting coefficient λ, plays a crucial role in optimizing performance.

Experimental Evaluation and Results

The experiments conducted primarily utilize the GLUE benchmark, focusing on tasks ranging from sentiment analysis (SST-2) to textual entailment (QNLI, MNLI). The findings are compelling:

Few-shot Learning Improvements:
- For 20 training examples, the SCL-enhanced model improved QNLI results by 10.7 points compared to baseline CE, indicating robust performance with minimal data.
- As the training size increases (N=100, 1000), improvements persist but with diminishing returns, highlighting the approach's strength in data-scarce scenarios.
Robustness to Noise:
- The model's robustness was evaluated using augmented noisy datasets, constructed via back-translation with varying noise levels. SCL significantly enhanced model performance under noisy conditions, especially for inference tasks like MNLI, achieving up to a 7-point gain at higher noise levels.
Full Dataset Performance:
- Although less pronounced than in few-shot settings, notable gains were observed in fully supervised environments, including a 3.5-point increase in QNLI accuracy, suggesting that SCL can benefit conventional data-rich scenarios as well.
Generalization to Related Tasks:
- Transfer learning experiments demonstrated improvements when applying models fine-tuned with the SCL objective to related datasets such as Amazon-2 and Yelp-2, attesting to enhanced generalizability.

Implications and Future Prospects

The incorporation of SCL into the fine-tuning pipeline of pre-trained LLMs not only improves the convergence stability and consistency across different runs but also promises more robust and generalizable models, particularly in few-shot learning applications. It opens avenues for further exploring contrastive learning mechanisms, potentially enhancing semi-supervised and unsupervised NLP applications.

Future research might focus on scaling this approach, perhaps incorporating automated data augmentation techniques or optimizing batch sizes for contrastive learning, to further elevate its efficacy across broader tasks and datasets in natural language understanding.