Revisiting the Calibration of Modern Neural Networks (2106.07998v2)

Published 15 Jun 2021 in cs.LG and cs.CV

Abstract: Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.

Citations (319)

View on Semantic Scholar

Summary

The paper demonstrates that modern architectures like Vision Transformers and MLP-Mixer maintain strong calibration despite increases in model size.
It establishes that architectural design is more pivotal for calibration than model size or the extent of pretraining.
The study validates temperature scaling as an effective post-hoc method to enhance model reliability across distribution shifts.

Revisiting the Calibration of Modern Neural Networks: A Summary

The paper "Revisiting the Calibration of Modern Neural Networks" presents a comprehensive empirical examination of the calibration properties of recent state-of-the-art image classification models. The authors address a critical aspect of neural network deployment, particularly in safety-critical domains such as autonomous driving and medical diagnostics: the calibration of model predictions refers to how accurately a model's predictive probabilities reflect the true likelihood of outcomes.

Key Findings

Relief from a Negative Trend: Historical analysis indicated a deterioration in calibration with increasing model size and accuracy, a trend that was apparent in convolutional networks. However, this work reveals that recent non-convolutional architectures, specifically those employing MLP-Mixer and Vision Transformers, exhibit excellent calibration properties compared to their predecessors. These models do not suffer the same degradation in calibration seen in earlier generations when subjected to distribution shifts or increased model size.
Architectural Contributions: While model size and the extent of pretraining were traditionally thought to influence model calibration, the authors demonstrate that architectural design is a more significant determinant. Notably, they show that neither the model size nor the amount of pretraining alone can fully account for observed differences in calibration across model architectures. This observation underscores an evolving landscape where novel architectural designs such as Vision Transformers contribute meaningfully to both calibration and accuracy.
Temperature Scaling as a Calibration Tool: Consistent with earlier studies, the paper validates the efficacy of post-hoc calibration methods like temperature scaling, which adjusts model confidence to minimize calibration error without altering classification accuracy. This technique also functions as a tool to reveal intrinsic calibration properties, providing clearer distinctions between model families beyond superficial confidence biases.
Consistency Across Dataset Shifts: The paper assesses calibration not only in-distribution but also under assorted types of distribution shifts, using datasets such as ImageNet-C and ImageNet-R. A robust calibration that persists across distribution shifts is critical for realistic deployments, and the paper's findings about MLP-Mixer and Vision Transformers having calibrated performance under such conditions are particularly significant.
Correlation with Accuracy: Intriguingly, the paper expands on a relationship between accuracy and calibration, noting a consistent correlation, particularly under distribution shift. This means that improvements in accuracy often complement enhancements in calibration, although within a given architecture, this could still manifest as trade-offs dictated by unique architectural properties.

Theoretical and Practical Implications

The paper reinforces the importance of considering both accuracy and calibration in the evaluation of neural networks, particularly when these networks are implemented in scenarios where decisions based on model outputs have real-world consequences. From a theoretical standpoint, the results suggest pathways for architectural innovation, driving progress towards models that are simultaneously more accurate and better calibrated. Practically, the insights on temperature scaling emphasize its utility in addressing systematic biases in model confidence, thereby enhancing model reliability post-deployment without retraining.

Future Directions

The identification of architecture as a key factor in neural network calibration suggests multiple avenues for future research. Innovations in neural architectures could focus on optimizing calibration directly, rather than relying solely on post-hoc adjustments. Further, extending the results to other modalities and domains, beyond image classification, would provide a more comprehensive understanding of model calibration in varied AI applications.

In conclusion, this paper provides a clear-eyed reassessment of model calibration in modern neural networks, advocating for a nuanced understanding that balances architectural choices, pretraining strategies, and post-hoc calibration methods. The work presents significant implications for both the development and deployment of neural networks in critical applications and establishes a foundation for further exploration into the architectural basis of model calibration.