Analysis and Comparison of Classification Metrics (2209.05355v4)
Abstract: A variety of different performance metrics are commonly used in the machine learning literature for the evaluation of classification systems. Some of the most common ones for measuring quality of hard decisions are standard and balanced accuracy, standard and balanced error rate, F-beta score, and Matthews correlation coefficient (MCC). In this document, we review the definition of these and other metrics and compare them with the expected cost (EC), a metric introduced in every statistical learning course but rarely used in the machine learning literature. We show that both the standard and balanced error rates are special cases of the EC. Further, we show its relation with F-beta score and MCC and argue that EC is superior to these traditional metrics for being based on first principles from statistics, and for being more general, interpretable, and adaptable to any application scenario. The metrics mentioned above measure the quality of hard decisions. Yet, most modern classification systems output continuous scores for the classes which we may want to evaluate directly. Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk, among others. The last three metrics are special cases of a family of metrics given by the expected value of proper scoring rules (PSRs). We review the theory behind these metrics, showing that they are a principled way to measure the quality of the posterior probabilities produced by a system. Finally, we show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE), arguing that calibration loss based on PSRs is superior to the ECE for being more interpretable, more general, and directly applicable to the multi-class case, among other reasons.
- An introduction to statistical learning, vol. 112, Springer, 2013.
- C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
- The Elements of Statistical Learning, Springer-Verlag, 2001.
- C. J. Van Rijsbergen, Information Retrieval, Butterworths, 1979.
- Charles Elkan, “The foundations of cost-sensitive learning,” Proceedings of the Seventeenth International Conference on Artificial Intelligence: 4-10 August 2001; Seattle, vol. 1, 05 2001.
- Morris H. DeGroot, Optimal Statistical Decisions, McGraw-Hill, 1970.
- “Strictly proper scoring rules, prediction, and estimation,” Journal of the American Statistical Association, 2012.
- N. Brümmer, Measuring, Refining and Calibrating Speaker and Language Information Extracted from Speech, Ph.D. thesis, Stellenbosch University, 2010.
- “On calibration of modern neural networks,” in Proc. of the 34th International Conference on Machine Learning, Sydney, Australia, 2017.
- “Out of a Hundred Trials, How Many Errors Does Your Speaker Verifier Make?,” in Proc. Interspeech, Brno, Czech Republic, Sept. 2021.
- D. A Van Leeuwen and N. Brümmer, “An introduction to application-independent evaluation of speaker recognition systems,” in Speaker classification I: Fundamentals, Features, and Methods. Springer-Verlag, 2007.
- “NIST Speaker Recognition Evaluations,” http://www.nist.gov/ itl/iad/mig/sre.cfm.
- Deborah Ashby and Adrian F. M. Smith, “Evidence-based medicine as bayesian decision-making,” Statistics in Medicine, vol. 19, no. 23, pp. 3291–3305, 2000.
- “Bayesian decision analysis for choosing between diagnostic/prognostic prediction procedures,” Statistics and Its Interface, vol. 4, no. 1, pp. 27–36, 2011.
- “Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests,” BMJ, vol. 352, 2016.
- “Deployment of image analysis algorithms under prevalence shifts,” arxiv:2303.12540, 2023.
- B. W. Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” in Biochim Biophys Acta - Protein Struct., 1975.
- “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, 05 2000.
- J. Gorodkin, “Comparing two k-category assignments by a k-category correlation coefficient,” Computational Biology and Chemistry, 2004.
- “The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, Jan 2020.
- “Therapeutic decision making: A cost-benefit analysis,” New England Journal of Medicine, vol. 293, no. 5, pp. 229–234, Jul 1975.
- N. Brümmer and G. Doddington, “Likelihood-ratio calibration using prior-weighted proper scoring rules,” in Proc. Interspeech, Lyon, France, Aug. 2013.
- “Theory and applications of proper scoring rules,” METRON, vol. 72, no. 2, pp. 169–183, Apr 2014.
- Pattern Classification, Wiley, 2001.
- “On calibration of language recognition scores,” in Proc. Odyssey-06, Puerto Rico, USA, June 2006.
- “Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration,” in Machine Learning and Knowledge Discovery in Databases. 2015, Springer International Publishing.
- “The comparison and evaluation of forecasters,” Journal of the Royal Statistical Society. Series D (The Statistician), vol. 32, no. 1/2, pp. 12–22, 1983.
- J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Adv. Large Margin Classifiers, 2000.
- “Generative modelling for unsupervised score calibration,” in in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2014.
- “Tied normal variance–mean mixtures for linear score calibration,” in in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
- “Better uncertainty calibration via proper scores for classification and beyond,” in in Proc. of NeurIPS, New Orleans, December 2022.
- “Obtaining well calibrated probabilities using bayesian binning,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- “Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers,” in Proc. of ICML, May 2001.
- “Binary classifier calibration: Non-parametric approach,” in Proc. of SIAM Int Conf Data Min, 2015.
- “Measuring calibration in deep learning.,” in CVPR Workshops, 2019.
- “Calibration tests in multi-class classification: A unifying framework,” in Proc. of NeurIPS, Vancouver, December 2019.
- “Radiomics based on adapted diffusion kurtosis imaging helps to clarify most mammographic findings suspicious for cancer,” Radiology, vol. 287, no. 3, pp. 761–770, 2018.
- “Twitter-COMMs: Detecting climate, COVID, and military multimodal misinformation,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, July 2022, Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.
 
          