Calibration of Pre-trained Transformers

Published 17 Mar 2020 in cs.CL and cs.LG | (2003.07892v3)

Abstract: Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated. Specifically, do these models' posterior probabilities provide an accurate empirical measure of how likely the model is to be correct on a given example? We focus on BERT and RoBERTa in this work, and analyze their calibration across three tasks: natural language inference, paraphrase detection, and commonsense reasoning. For each task, we consider in-domain as well as challenging out-of-domain settings, where models face more examples they should be uncertain about. We show that: (1) when used out-of-the-box, pre-trained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5x lower; (2) temperature scaling is effective at further reducing calibration error in-domain, and using label smoothing to deliberately increase empirical uncertainty helps calibrate posteriors out-of-domain.

Abstract PDF Upgrade to Chat

Citations (262)

View on Semantic Scholar

Summary

The paper shows that pre-trained Transformers, especially RoBERTa, achieve lower in-domain calibration errors without post-processing.
The paper demonstrates that these models significantly outperform non-pre-trained counterparts in out-of-domain settings, notably on datasets like HellaSWAG.
The paper reveals that applying techniques such as temperature scaling and label smoothing further refines calibration, enhancing reliable confidence estimates.

Overview of "Calibration of Pre-trained Transformers"

The paper "Calibration of Pre-trained Transformers" by Shrey Desai and Greg Durrett critically examines the calibration of pre-trained Transformer models, focusing specifically on BERT and RoBERTa. Calibration, in this context, refers to the alignment of a model's predicted confidence with empirical accuracy—essentially, if a model assigns a 70% probability to an event, that event should occur 70% of the time. This work scrutinizes both in-domain and out-of-domain performance across three tasks: natural language inference, paraphrase detection, and commonsense reasoning.

Key Findings

The paper conveys several significant findings regarding the performance and calibration of these models:

In-domain Calibration: The research reveals that, when utilized without post-processing adjustments, BERT and RoBERTa are relatively well-calibrated within their domain. The expected calibration error (ECE) is notably lower in comparison to non-pre-trained models, with RoBERTa consistently outperforming BERT in terms of in-domain calibration.
Out-of-domain Performance: Pre-trained models substantially outperform non-pre-trained counterparts in out-of-domain settings, presenting significantly lower ECE, especially notable on challenging datasets such as HellaSWAG, where RoBERTa's ECE is reduced by a factor of 3.4 over simpler models.
Temperature Scaling: Implementing temperature scaling is a pragmatic technique that improves in-domain calibration with little computational overhead, as evidenced by BERT and RoBERTa achieving ECE values between 0.7 to 0.8 in these settings. The efficacy of temperature scaling underscores that pre-trained models produce predictions that are inherently well-suited for this type of scaling to create calibrated probability estimates.
Label Smoothing: While traditional maximum likelihood estimation provides the best in-domain calibration, models trained with label smoothing show promise out-of-domain. This training regime helps to counteract overconfidence, which is particularly beneficial when encountering adversarial or shifted data distributions.

Practical and Theoretical Implications

The implications of this study are two-fold:

Deployment Confidence: With improved calibration, these models can furnish more reliable confidence estimates, facilitating safer deployment in applications where understanding model uncertainty is critical, such as in automated decision-making systems.
Model Diagnostics and Trust: Calibration offers an avenue towards demystifying the "black-box" nature of deep learning systems, providing a quantitative measure by which the uncertainty of models can be assessed. This could catalyze advancements in designing more transparent and interpretable AI systems.

Future Directions

Future work could explore calibration across a wider array of pre-trained architectures and explore the implications of domain shift in a broader array of applied settings. Additionally, considering the scale and complexity of modern Transformer models, further research could investigate the balance between model size, complexity, and calibration, potentially leading to new architectures that maintain high performance while ensuring robust calibration across domains.

In conclusion, this paper contributes valuable insights into the calibration characteristics of Transformer-based models, offering actionable methodologies like temperature scaling and label smoothing for improving calibration in both in-domain and out-of-domain scenarios. This positions the work as a foundational step towards enhancing the reliability of probabilistic predictions in natural language processing pipelines.

Markdown Report Issue