Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Iterative Learning with Open-set Noisy Labels (1804.00092v1)

Published 31 Mar 2018 in cs.CV

Abstract: Large-scale datasets possessing clean label annotations are crucial for training Convolutional Neural Networks (CNNs). However, labeling large-scale data can be very costly and error-prone, and even high-quality datasets are likely to contain noisy (incorrect) labels. Existing works usually employ a closed-set assumption, whereby the samples associated with noisy labels possess a true class contained within the set of known classes in the training data. However, such an assumption is too restrictive for many applications, since samples associated with noisy labels might in fact possess a true class that is not present in the training data. We refer to this more complex scenario as the \textbf{open-set noisy label} problem and show that it is nontrivial in order to make accurate predictions. To address this problem, we propose a novel iterative learning framework for training CNNs on datasets with open-set noisy labels. Our approach detects noisy labels and learns deep discriminative features in an iterative fashion. To benefit from the noisy label detection, we design a Siamese network to encourage clean labels and noisy labels to be dissimilar. A reweighting module is also applied to simultaneously emphasize the learning from clean labels and reduce the effect caused by noisy labels. Experiments on CIFAR-10, ImageNet and real-world noisy (web-search) datasets demonstrate that our proposed model can robustly train CNNs in the presence of a high proportion of open-set as well as closed-set noisy labels.

Citations (317)

Summary

  • The paper proposes an iterative framework that refines noisy label detection and improves CNN training in the presence of open-set and closed-set noise.
  • It employs a probabilistically extended Local Outlier Factor with a Siamese network to distinguish clean samples from noisy ones effectively.
  • Experiments on datasets like CIFAR-10 and ImageNet show superior robustness and accuracy, validating the approach on both controlled and real-world noisy data.

Iterative Learning with Open-set Noisy Labels

The paper "Iterative Learning with Open-set Noisy Labels" addresses a critical challenge in the training of Convolutional Neural Networks (CNNs)—the pervasive issue of noisy label data, specifically within the open-set noisy label setting. In conventional closed-set scenarios, label noise is assumed to be confined within the known set of classes. However, the open-set situation introduces a layer of complexity where the true class of a mislabeled sample is absent from the set of known classes. This situation is highly relevant in real-world applications, such as those leveraging web-sourced datasets, which inherently contain a mix of in-distribution (closed-set) and out-of-distribution (open-set) noise.

To tackle this, the authors propose a novel iterative framework designed to train CNNs effectively even when faced with a significant amount of noisy labels—both open- and closed-set. The framework integrates three core components: iterative noisy label detection using a probabilistic extension of the Local Outlier Factor (LOF) methodology, discriminative feature learning via a Siamese network structure, and a reweighting mechanism tailored for robust softmax loss accommodation.

The iterative detection of noisy labels relies on the probabilistically cumulative Local Outlier Factor (pcLOF). This approach detects samples based on their representational inconsistency within their assigned class and iteratively refines this detection, leveraging more discriminative features produced in each learning iteration. Utilizing a Siamese network with contrastive loss, the framework optimizes the feature learning by ensuring that representations of clean samples remain distinct from those of noisy samples. The reweighting strategy further fine-tunes the model training by assigning appropriate weights to samples based on their noise likelihood, allowing the model to focus on clean data without entirely discarding potentially valuable information from noisy samples.

The evaluations on CIFAR-10, ImageNet, and a web-search dataset demonstrate the robustness of the proposed framework. On CIFAR-10 with 40% open-set noise, the model outperforms state-of-the-art methods, achieving superior classification accuracy through its strategic handling of noise. Similarly, on ImageNet, the framework maintains competitive results across network architectures such as ResNet-50 and Inception-v3, indicating its applicability to large-scale datasets. Real-world web data evaluation further underscores its practical value, highlighting its capability to leverage webly supervised data for enhanced CNN training.

This research brings forth significant contributions to the discourse on noisy label learning by expanding the paradigm to encompass open-set conditions, underscoring the importance of iterative, discriminative, and reweighting strategies in dealing with heterogeneous noise. The results not only demonstrate the efficacy of the approach proposed but also set a foundation for future explorations in robust representation learning from ubiquitously noisy datasets. Future work could extend this methodology by incorporating advanced feature extraction techniques and exploring additional adaptive noise handling mechanisms, ultimately pushing further the boundaries of learning from imperfect data environments.