Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

361 1 12

A Closer Look at AUROC and AUPRC under Class Imbalance (2401.06091v3)

Published 11 Jan 2024 in cs.LG and stat.ME

Abstract: In ML, a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing LLMs to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.

References (153)

Authors (5)

Matthew B. A. McDermott (22 papers)
Lasse Hyldig Hansen (2 papers)
Haoran Zhang (102 papers)
Giovanni Angelotti (1 paper)
Jack Gallifant (17 papers)

Citations (14)

View on Semantic Scholar

Summary

The paper reveals that AUROC and AUPRC are probabilistically interrelated, challenging the assumption that AUPRC is superior in imbalanced settings.
It demonstrates that AUROC treats all false positives uniformly, while AUPRC biases improvements toward high-scoring subpopulations.
An extensive literature review uncovers misattributed claims, urging a shift to evidence-based metric selection for fairness in applications like healthcare.

Overview of AUROC and AUPRC Evaluation Metrics

The debate over the most appropriate evaluation metrics for binary classification tasks, especially under class imbalance, has been ongoing within the machine learning community. Two core metrics often considered are the Area Under the Receiver Operating Characteristic (AUROC) and the Area Under the Precision-Recall Curve (AUPRC). This analysis seeks to dissect the widely held belief that AUPRC is a superior metric over AUROC in scenarios where positive instances are rarer than negative ones.

The Interrelationship Between AUROC and AUPRC

Research has uncovered that AUROC and AUPRC exhibit a probabilistic interrelation – a nuanced mathematical relationship that questions the assumption of AUPRC's superiority. The examination of these metrics reveals that while AUROC treats all false positives uniformly, AUPRC applies a weighting mechanism, adjusting the importance of false positives based on the model's likelihood of providing a high score - a factor termed as the "firing rate".

Implications for Metric Choice

The paper further examines the consequences of selecting models using AUROC versus AUPRC. The results are telling: optimizing based on AUROC equally weights improvements across the model's output distribution, making it devoid of bias towards any particular subset of samples. On the contrary, AUPRC inherently prioritizes reducing mistakes among high-scoring samples, which can skew model improvements towards subpopulations that are better represented in the data. Such bias might not only misguide model selection but could also introduce fairness issues in critical domains like healthcare, where equitable treatment of diverse patient populations is crucial.

Literature Analysis and Community Reflection

In a surprising turn, an extensive literature review of over 1.5 million papers discloses a lack of empirical support for the favoring of AUPRC. The review uncovers a pattern of misattributed citations and unchallenged assertions, raising concerns about the rigor of claims in scientific discourse within the field. This extensive survey has amplified the necessity for the machine learning community to critically reassess and ground our metric preferences in empirical evidence rather than perpetuate unfounded assertions.

Moving Forward

Conclusively, this research serves as a bridge towards more evidence-based practices within machine learning. It prompts the scientific community to reconsider ingrained beliefs regarding evaluation metric preferences, taking into account not only their mathematical properties but also their implications for fairness and equity. The findings underscore the importance of selecting evaluation metrics that align with deployment goals and the ethical considerations of our diverse and dynamic world.

GitHub

GitHub - mmcdermott/AUC_is_all_you_need: Analyzing different ML model comparison metrics (12 stars)

Tweets

https://twitter.com/MattBMcDermott/status/1745801525916516473

https://twitter.com/VickersBiostats/status/1745901888015401203

https://twitter.com/EveRichardson20/status/1796831319692841148

https://twitter.com/pandeyparul/status/1745999733582143729

https://twitter.com/MattBMcDermott/status/1745801528273752153

https://twitter.com/daniel_stahl_01/status/1746978476949725580

[2401.06091] A Closer Look at AUROC and AUPRC under Class Imbalance (1 point, 1 comment)