Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise (1802.05300v4)

Published 14 Feb 2018 in cs.LG, cs.CL, cs.CV, and cs.NE

Abstract: The growing importance of massive datasets used for deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling, non-expert labeling, and label corruption by data poisoning adversaries. Numerous previous works assume that no source of labels can be trusted. We relax this assumption and assume that a small subset of the training data is trusted. This enables substantial label corruption robustness performance gains. In addition, particularly severe label noise can be combated by using a set of trusted data with clean labels. We utilize trusted data by proposing a loss correction technique that utilizes trusted examples in a data-efficient manner to mitigate the effects of label noise on deep neural network classifiers. Across vision and natural language processing tasks, we experiment with various label noises at several strengths, and show that our method significantly outperforms existing methods.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Dan Hendrycks (64 papers)
  2. Mantas Mazeika (27 papers)
  3. Duncan Wilson (3 papers)
  4. Kevin Gimpel (72 papers)
Citations (524)

Summary

  • The paper introduces the Gold Loss Correction (GLC) technique that leverages a small trusted dataset to mitigate severe label noise in deep networks.
  • Experiments on benchmarks like CIFAR-10 and MNIST show that GLC significantly reduces error rates compared to existing correction methods.
  • The study demonstrates that integrating trusted data enhances data efficiency and model robustness, making it highly practical for noisy real-world applications.

Essay on "Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise"

The paper "Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise," authored by Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel, addresses the critical issue of label noise in supervised learning, particularly in the context of deep neural networks. This exploration is grounded on the assumption that real-world datasets are often susceptible to label noise due to sources like automatic labeling, non-expert labeling, and adversarial data poisoning.

Core Contributions

The authors diverge from conventional approaches by incorporating the notion of a trusted subset of the training data, which comprises examples with reliable labels. This paper proposes the Gold Loss Correction (GLC) technique, leveraging these trusted data points to enhance classifier robustness against label noise.

  1. Loss Correction Technique: The paper introduces a loss correction mechanism that utilizes trusted samples to effectively model and mitigate the effects of label noise. The authors demonstrate the capability of the GLC to significantly outperform existing methods, especially in scenarios marked by severe label corruption.
  2. Empirical Validation: Extensive experiments across vision and natural language processing tasks, such as MNIST, CIFAR, IMDB, and others, showcase the efficacy of the proposed method. The results highlight GLC's superiority in reducing error rates across a spectrum of noise levels and types.
  3. Data Efficiency: A salient feature of the GLC is its data efficiency, requiring only a minimal fraction of trusted data to achieve notable improvements in robustness.

Numerical Results & Discussion

The GLC demonstrated lower error rates compared to existing methods like Forward Loss Correction and Distillation, especially in datasets like CIFAR-10 and CIFAR-100. For instance, on CIFAR-10 with 10% trusted data, the GLC achieved an error of 6.9 compared to 22.7 and 18.3 for Forward and Distillation methods, respectively.

The paper further investigates the performance of GLC under various label corruption scenarios, such as uniform, flip, and hierarchical noise, illustrating the robustness of the approach. Notably, the GLC outperformed rival methods with access only to a small number of gold standard labels, reinforcing its practical utility in real-world applications.

Implications and Future Directions

The results reported underscore the effectiveness of integrating a small set of trusted data to reinforce the learning process under noisy conditions. The findings advocate for the adoption of this paradigm in settings where label noise is prevalent yet procuring extensive clean labels may be infeasible.

In theoretical terms, the work prompts further inquiry into the robustness of neural networks under adversarial label noise and the optimization landscapes shaped by semi-verified data.

Conclusion

"Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise" presents a significant advancement in enhancing the robustness of deep learning models against label noise. By artfully blending trusted and potentially corrupted data, the authors have provided a viable, data-efficient solution with broad applicability in machine learning systems relying on large, real-world datasets. The research invites further exploration of loss correction techniques and their integration with various model architectures and training paradigms.

Github Logo Streamline Icon: https://streamlinehq.com