Emergent Mind

A Closer Look at AUROC and AUPRC under Class Imbalance

(2401.06091)
Published Jan 11, 2024 in cs.LG and stat.ME

Abstract

In ML, a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing LLMs to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.

Correlation between AUROC gap, AUPRC in validation, and AUROC with prevalence ratios and confidence intervals.

Overview

  • The paper questions the preferential use of AUPRC over AUROC for binary classification tasks in imbalanced datasets.

  • It reveals a probabilistic connection between AUROC and AUPRC, challenging the assumption that AUPRC is always superior.

  • AUPRC is shown to introduce potential bias by favoring model improvements on better-represented subpopulations.

  • An extensive review of scholarly articles found little empirical evidence to support the common preference for AUPRC.

  • The research advocates for evidence-based selection of evaluation metrics, considering fairness and ethical implications.

Overview of AUROC and AUPRC Evaluation Metrics

The debate over the most appropriate evaluation metrics for binary classification tasks, especially under class imbalance, has been ongoing within the machine learning community. Two core metrics often considered are the Area Under the Receiver Operating Characteristic (AUROC) and the Area Under the Precision-Recall Curve (AUPRC). This analysis seeks to dissect the widely held belief that AUPRC is a superior metric over AUROC in scenarios where positive instances are rarer than negative ones.

The Interrelationship Between AUROC and AUPRC

Research has uncovered that AUROC and AUPRC exhibit a probabilistic interrelation – a nuanced mathematical relationship that questions the assumption of AUPRC's superiority. The examination of these metrics reveals that while AUROC treats all false positives uniformly, AUPRC applies a weighting mechanism, adjusting the importance of false positives based on the model's likelihood of providing a high score - a factor termed as the "firing rate".

Implications for Metric Choice

The study further examines the consequences of selecting models using AUROC versus AUPRC. The results are telling: optimizing based on AUROC equally weights improvements across the model's output distribution, making it devoid of bias towards any particular subset of samples. On the contrary, AUPRC inherently prioritizes reducing mistakes among high-scoring samples, which can skew model improvements towards subpopulations that are better represented in the data. Such bias might not only misguide model selection but could also introduce fairness issues in critical domains like healthcare, where equitable treatment of diverse patient populations is crucial.

Literature Analysis and Community Reflection

In a surprising turn, an extensive literature review of over 1.5 million papers discloses a lack of empirical support for the favoring of AUPRC. The review uncovers a pattern of misattributed citations and unchallenged assertions, raising concerns about the rigor of claims in scientific discourse within the field. This extensive survey has amplified the necessity for the machine learning community to critically reassess and ground our metric preferences in empirical evidence rather than perpetuate unfounded assertions.

Moving Forward

Conclusively, this research serves as a bridge towards more evidence-based practices within machine learning. It prompts the scientific community to reconsider ingrained beliefs regarding evaluation metric preferences, taking into account not only their mathematical properties but also their implications for fairness and equity. The findings underscore the importance of selecting evaluation metrics that align with deployment goals and the ethical considerations of our diverse and dynamic world.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

Reddit