Who Said What: Modeling Individual Labelers Improves Classification (1703.08774v2)

Published 26 Mar 2017 in cs.LG and cs.CV

Abstract: Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning averaging weights for combining them, possibly in sample-specific ways. This allows us to give more weight to more reliable experts and take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. We also show that our method performs better than competing algorithms by Welinder and Perona (2010), and by Mnih and Hinton (2012). Our work offers an innovative approach for dealing with the myriad real-world settings that use expert opinions to define labels for training.

Citations (218)

View on Semantic Scholar

Summary

The paper derives mutual information metrics that quantify noise, showing 60,000 noisy labels equate to about 1,148 clean labels.
It demonstrates that adjusting for individual annotator reliability can enhance model performance in noisy labeling environments.
Extensive hyperparameter tuning and dataset analysis on MNIST and retinal images validate the proposed noise mitigation strategies.

An Analysis of Noise Mitigation in Machine Learning Models for MNIST and Medical Imaging

The paper concerns itself with the problem of noise in labeled datasets, a significant issue when dealing with vast training sets such as those encountered in the MNIST dataset and in medical imaging scenarios involving doctor annotations. The authors derive a quantitative measure of the mutual information (MI) between noisy labels and the underlying true labels, providing a theoretical framework for understanding the impact of labeling noise on model performance.

Mutual Information and Labeling Noise

The paper begins by establishing the mutual information quantities for perfectly and noisily labeled data within an MNIST context. For a ten-class problem like MNIST, the mutual information of perfect labels is approximately 2.3 nats, while the MI is significantly reduced to 0.044 nats when labels are only 20% correct due to noise. This result is instrumental in equating the value of noisy labels relative to clean labels. Specifically, the authors demonstrate that 60,000 noisy labels are virtually equivalent to about 1,148 clean labels given the MI-derived equivalence, corroborating empirical findings with this estimate.

Experimental Validation and Model Adjustments

A few transformative ideas were assessed to potentially mitigate noise-related performance degradation in classification tasks.

Mean Class Balancing: Adjusting class weights inversely proportional to class prevalence was trialed but found to degrade performance. This method likely imposed incorrect assumptions about the underlying distribution of test data.
Alternative Target Distributions: The training process utilized a target distribution informed by doctor annotations, yet alternative approaches, such as averaging doctor model predictions, yielded inferior outcomes, which suggests the inadequacy of the model's interpretation of consensus-based labels.
Symmetric Noise Modeling: A method predicated on a symmetric noise model was examined. Despite making fewer assumptions about class distribution variance, it performed poorly compared to existing methods. However, by tailoring the noise parameter to individual doctors' reliability, a new avenue opens for personalized adjustments in multi-annotator environments.

Hyperparameter Tuning and Dataset Analysis

A comprehensive hyperparameter search is detailed, covering parameters such as learning rates, dropout levels, and weight decay across various model architectures—BN, DN, WDN, and BIWDN. These hyperparameters were fine-tuned through grid search methodologies, targeting optimization specific to computer-aided diagnosis tasks.

The dataset utilized for validation and testing predominantly consists of retinal images, constituted from pre-existing data sources like EyePACS-1 and Messidor-2. The integration of newly acquired images fortifies the dataset, ensuring a more robust exploration of model efficacy in health diagnostic contexts.

Implications and Future Directions

The exploration of MI degradation due to noise elucidates a key challenge in employing machine learning for sensitive applications, such as healthcare diagnostics, where the cost of erroneous predictions can be high. By providing empirical measurements and solutions for correcting this degradation, the authors lay groundwork for further studies into noise-resistant models.

Future work could delve into more granular noise models customized to specific annotative behaviors or enhance MI calculations in multi-class settings. Additionally, adapting these principles to other datasets and domains will test the robustness and generality of the insights gained. These considerations may lead to both theoretical advancements and practical improvements in reliability across fields heavily reliant on annotated data.

PDF Markdown