ProMix: Combating Label Noise via Maximizing Clean Sample Utility (2207.10276v4)

Published 21 Jul 2022 in cs.LG

Abstract: Learning with Noisy Labels (LNL) has become an appealing topic, as imperfectly annotated data are relatively cheaper to obtain. Recent state-of-the-art approaches employ specific selection mechanisms to separate clean and noisy samples and then apply Semi-Supervised Learning (SSL) techniques for improved performance. However, the selection step mostly provides a medium-sized and decent-enough clean subset, which overlooks a rich set of clean samples. To fulfill this, we propose a novel LNL framework ProMix that attempts to maximize the utility of clean samples for boosted performance. Key to our method, we propose a matched high confidence selection technique that selects those examples with high confidence scores and matched predictions with given labels to dynamically expand a base clean sample set. To overcome the potential side effect of excessive clean set selection procedure, we further devise a novel SSL framework that is able to train balanced and unbiased classifiers on the separated clean and noisy samples. Extensive experiments demonstrate that ProMix significantly advances the current state-of-the-art results on multiple benchmarks with different types and levels of noise. It achieves an average improvement of 2.48\% on the CIFAR-N dataset. The code is available at https://github.com/Justherozen/ProMix

Citations (42)

View on Semantic Scholar

Summary

The paper introduces ProMix, a novel framework that combats label noise by maximizing the utility of clean samples, shifting focus from merely filtering noisy data.
ProMix employs a Matched High Confidence Selection (MHCS) strategy to dynamically identify and utilize clean samples while addressing biases during semi-supervised training.
Evaluations show ProMix surpasses state-of-the-art methods on various datasets, achieving significant performance gains, such as a 2.48% average improvement on CIFAR-N.

ProMix: Combating Label Noise via Maximizing Clean Sample Utility

The paper "ProMix: Combating Label Noise via Maximizing Clean Sample Utility" presents a sophisticated framework for addressing the challenges posed by learning from datasets that contain noisy labels. This is a significant issue in machine learning, as acquiring massive amounts of accurately labeled data is often prohibitively expensive and time-consuming. Leveraging imperfectly annotated data emerges as a cost-effective alternative, but it brings the challenge of label noise, which can deteriorate model performance.

Key Contributions

ProMix introduces a novel Learning with Noisy Labels (LNL) framework that focuses on maximizing the utility of clean samples, which is an enhancement over previous methods that aim to filter out noisy samples and treat them as unlabeled data for semi-supervised learning (SSL). The main components of ProMix are:

Matched High Confidence Selection (MHCS): This selection strategy is pivotal, as it selects samples with high prediction confidence that match their given labels, thereby dynamically expanding the base clean sample set. This technique seeks a balance between quality and quantity, ensuring that more clean samples are utilized without compromising precision.
Debiased Semi-Supervised Training: To address biases inherent in the selection and pseudo-labeling processes, ProMix employs a debiased training strategy featuring an Auxiliary Pseudo Head (APH) and a Debiased Margin-based Loss (DML). These components act against confirmation bias and distribution bias, respectively, improving the resilience and robustness of learned models against noisy labels.
Label Guessing by Agreement (LGA): To further refine the selection process, LGA corrects mislabeled data by employing dual peer networks to agree on predicted labels with high confidence, thus progressively cleaning the dataset.

Experimental Results

The empirical evaluation of ProMix illustrates its ability to outperform current state-of-the-art methods across various benchmarks, including CIFAR-10/100, CIFAR-N, Clothing1M, and ANIMAL-10N. Specifically, ProMix achieves an average improvement of 2.48% on the CIFAR-N dataset. These results demonstrate the effectiveness of its components in leveraging clean samples and overcoming the pitfalls of label noise.

Implications

The work presented in this paper has significant implications for both theoretical understanding and practical applications of learning from noisy data:

Theoretical: ProMix demonstrates a novel approach toward exploiting clean samples amidst predominantly noisy data. Its strategy for handling biases during selection and training could be recontextualized and applied within broader machine learning frameworks that deal with imperfect data.
Practical: By improving upon existing methods, ProMix offers a more robust approach to training models in real-world scenarios where labels might be inherently noisy due to crowdsourcing or automated label generation processes.

Speculations on Future Developments

The methodology proposed by ProMix opens avenues for further research into enhancing SSL methods by better utilization of clean samples. Future developments may include:

Integrating advanced confidence-measuring techniques to refine MHCS further.
Exploring different architectures and strategies for APH to reduce confirmation bias more effectively.
Investigating the application of ProMix-like frameworks to other types of data imperfections, such as biased or incomplete data.

ProMix marks a substantial advancement in the domain of learning with noisy labels by emphasizing the utility of clean samples, laying the groundwork for future innovations in managing noisy datasets.