A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks (1912.10481v1)

Published 22 Dec 2019 in stat.ML, cs.LG, and eess.IV

Abstract: Evaluation of Bayesian deep learning (BDL) methods is challenging. We often seek to evaluate the methods' robustness and scalability, assessing whether new tools give better' uncertainty estimates than old ones. These evaluations are paramount for practitioners when choosing BDL tools on-top of which they build their applications. Current popular evaluations of BDL methods, such as the UCI experiments, are lacking: Methods that excel with these experiments often fail when used in application such as medical or automotive, suggesting a pertinent need for new benchmarks in the field. We propose a new BDL benchmark with a diverse set of tasks, inspired by a real-world medical imaging application on \emph{diabetic retinopathy diagnosis}. Visual inputs (512x512 RGB images of retinas) are considered, where model uncertainty is used for medical pre-screening---i.e. to refer patients to an expert when model diagnosis is uncertain. Methods are then ranked according to metrics derived from expert-domain to reflect real-world use of model uncertainty in automated diagnosis. We develop multiple tasks that fall under this application, including out-of-distribution detection and robustness to distribution shift. We then perform a systematic comparison of well-tuned BDL techniques on the various tasks. From our comparison we conclude that some current techniques which solve benchmarks such as UCIoverfit' their uncertainty to the dataset---when evaluated on our benchmark these underperform in comparison to simpler baselines. The code for the benchmark, its baselines, and a simple API for evaluating new BDL tools are made available at https://github.com/oatml/bdl-benchmarks.

Citations (104)

View on Semantic Scholar

Summary

The paper introduces a new scalable benchmark using retinal images to evaluate Bayesian deep learning in diabetic retinopathy tasks.
It systematically compares methods like MC Dropout, Variational Inference, and Deep Ensembles, showing ensemble MC Dropout provides superior uncertainty estimates.
The study emphasizes real-world applicability by addressing distribution shifts with metrics like diagnostic accuracy and area under the ROC curve.

Overview of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks

The paper "A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks" addresses the evaluation of Bayesian Deep Learning (BDL) methods, particularly their robustness and scalability in medical imaging tasks. It critiques existing evaluation benchmarks, such as the UCI datasets, highlighting their lack of scalability and applicability to real-world tasks, specifically in high-stakes fields like medical diagnostics.

Key Contributions

The authors introduce a new BDL benchmark inspired by a real-world application—diagnosing diabetic retinopathy from retinal images. This benchmark includes:

Visual inputs of $512 \times 512$ RGB retina images, emphasizing uncertainty estimation for medical pre-screening, i.e., referring patients to an expert if the model's diagnosis is uncertain.
Diverse tasks that assess out-of-distribution detection and robustness to distribution shifts, offering a realistic challenge to current BDL methods.
A systematic comparison of well-tuned BDL techniques on these tasks, revealing that methods excelling on simpler datasets like UCI often underperform here.

Metrics derived from expert domains rank methods, and the implementation, complete with a simple API, has been made available to facilitate the evaluation of new BDL tools.

Methodological Insights

Several BDL techniques are systematically compared, including MC Dropout, Mean-field Variational Inference, and Deep Ensembles. The paper employs:

A large dataset from the Kaggle Diabetic Retinopathy Challenge, supplemented by test data from the APTOS 2019 Blindness Detection dataset to assess robustness to distribution shifts.
A binary classification task that identifies sight-threatening diabetic retinopathy, adapting it to BDL suitability.
Metrics such as diagnostic accuracy and area under the ROC curve, contingent on referral rates, to simulate real-world expert referrals.

Strong empirical evidence from the paper suggests that ensemble methods and MC Dropout provide more reliable uncertainty estimates than Mean-field Variational Inference, with ensemble MC Dropout consistently performing best.

Implications and Future Directions

This research highlights the inadequacies of prevalent benchmarks like UCI for evaluating BDL methodologies in large-scale and high-dimension tasks. The new benchmark offers a comprehensive evaluation framework that is both scalable and indicative of real-world applicability. The findings suggest that reliance on UCI datasets could skew research priorities towards less scalable methods.

By introducing a benchmark aligned with real-world applications, the paper sets a precedent for evaluating BDL methods on metrics that matter in practical deployment, such as uncertainty estimation and robustness under distribution shifts. As a result, it could redirect focus towards developing computationally efficient and scalable BDL methods that align with the challenges posed by massive data and complex real-world scenarios.

Future research in AI and BDL can leverage this benchmark to refine methods that better handle high-dimensional datasets, improve uncertainty quantification, and ensure that deep learning models operate robustly in critical areas like healthcare. This could accelerate advancements in automated diagnostics and decision-making systems, fostering effective integration of BDL into these domains.

PDF Markdown

Related Papers

GitHub

GitHub - OATML/bdl-benchmarks: Bayesian Deep Learning Benchmarks (665 stars)