Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DivideMix: Learning with Noisy Labels as Semi-supervised Learning (2002.07394v1)

Published 18 Feb 2020 in cs.CV

Abstract: Deep neural networks are known to be annotation-hungry. Numerous efforts have been devoted to reducing the annotation cost when learning with deep networks. Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data. In this work, we propose DivideMix, a novel framework for learning with noisy labels by leveraging semi-supervised learning techniques. In particular, DivideMix models the per-sample loss distribution with a mixture model to dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples, and trains the model on both the labeled and unlabeled data in a semi-supervised manner. To avoid confirmation bias, we simultaneously train two diverged networks where each network uses the dataset division from the other network. During the semi-supervised training phase, we improve the MixMatch strategy by performing label co-refinement and label co-guessing on labeled and unlabeled samples, respectively. Experiments on multiple benchmark datasets demonstrate substantial improvements over state-of-the-art methods. Code is available at https://github.com/LiJunnan1992/DivideMix .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Junnan Li (56 papers)
  2. Richard Socher (115 papers)
  3. Steven C. H. Hoi (94 papers)
Citations (927)

Summary

  • The paper introduces a co-divide mechanism using dual networks and Gaussian Mixture Models to separate clean from noisy samples.
  • It refines labels through an enhanced MixMatch strategy that incorporates both co-refinement and co-guessing to reduce overfitting.
  • Empirical tests on CIFAR-10, CIFAR-100, and other datasets demonstrate significant accuracy gains in extreme noise conditions.

Learning with Noisy Labels as Semi-supervised Learning: An Overview of DivideMix

In the field of deep neural networks (DNNs), the availability of large datasets with precise labels has been a cornerstone of the achieved successes. However, such datasets are costly and time-consuming to create. Consequently, the challenge of learning from datasets with noisy labels has become increasingly relevant. This paper proposes a novel solution, DivideMix, which merges concepts from both noisy label learning and semi-supervised learning (SSL) to enhance DNN training outcomes.

Key Contributions

  1. Co-divide Mechanism: The DivideMix framework introduces a co-divide mechanism for separating the clean and noisy samples in the dataset. By simultaneously training two networks that dynamically model the loss distribution using Gaussian Mixture Models (GMMs), the framework effectively isolates clean samples as a labeled set while relegating noisy samples to an unlabeled set. The datasets divided by one network are used to train the other, which helps mitigate the progressive error accumulation, or confirmation bias.
  2. Enhanced MixMatch Strategy: Building upon the MixMatch strategy, DivideMix employs label co-refinement for labeled samples and label co-guessing for unlabeled samples. This involves refining ground-truth labels using the model's predictions while also leveraging ensemble methods to infer the labels of the most likely noisy samples. This dual-label refining mechanism ensures greater resistance against overfitting to noisy labels.
  3. Strong Empirical Performance: Experimentation across multiple datasets, including CIFAR-10, CIFAR-100, Clothing1M, and WebVision, reveals that DivideMix consistently outperforms state-of-the-art methods. Detailed experiments demonstrate significant accuracy improvements, especially under high noise conditions. For example, on CIFAR-100 with 90% symmetric noise, DivideMix achieves an impressive test accuracy of 31.5%, compared to substantially lower accuracies from other methods.

Methodology

Co-divide by Loss Modeling

Deep networks tend to learn clean samples faster, leading to lower loss values for these samples. Leveraging this principle, DivideMix fits a two-component GMM to the loss distribution to estimate clean sample probabilities. Contrary to prior methods using Beta Mixture Models which suffer from flat distributions, GMMs offer improved flexibility and performance, especially under asymmetrical noise conditions. A confidence penalty during the initial warm-up phase prevents networks from making over-confident predictions, thereby aiding the GMM in distinguishing clean and noisy samples effectively.

MixMatch Integration with Label Refining

DivideMix improves upon the MixMatch SSL technique by incorporating co-network guidance:

  • Label Co-refinement: Combines true labels with network predictions, weighted by their clean probabilities.
  • Label Co-guessing: Utilizes ensemble averaging for reliable label prediction of unlabeled samples. These modifications ensure that both networks mutually benefit from refined and guessed labels, facilitating more robust learning.

Implications for Future Research

The DivideMix framework's confluence of noisy label learning and SSL opens a new trajectory for reducing annotation costs while maintaining model integrity. Extensions of this work might explore further enhancements in SSL techniques or adaptation to domains such as NLP. Additionally, investigating more complex mixture model configurations could provide deeper insights into the dynamics of noisy label distributions.

Conclusion

DivideMix represents a significant advancement in the domain of learning with noisy labels by leveraging semi-supervised learning techniques. Through a meticulous combination of dataset co-division, label refinement, and ensemble augmentation, it sets a new benchmark for dealing with high levels of label noise. The demonstrated efficacy across diverse datasets suggests promising avenues for future development and application in varied machine learning tasks. As AI continues to evolve, methods like DivideMix will be crucial in making robust learning feasible even when high-quality datasets are scarce or expensive to obtain.