Emergent Mind

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets

(2312.02200)
Published Dec 2, 2023 in cs.CV , cs.AI , and stat.AP

Abstract

Major advancements in computer vision can primarily be attributed to the use of labeled datasets. However, acquiring labels for datasets often results in errors which can harm model performance. Recent works have proposed methods to automatically identify mislabeled images, but developing strategies to effectively implement them in real world datasets has been sparsely explored. Towards improved data-centric methods for cleaning real world vision datasets, we first conduct more than 200 experiments carefully benchmarking recently developed automated mislabel detection methods on multiple datasets under a variety of synthetic and real noise settings with varying noise levels. We compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that we craft, and find that SEMD performs similarly to or outperforms prior mislabel detection approaches. We then apply SEMD to multiple real world computer vision datasets and test how dataset size, mislabel removal strategy, and mislabel removal amount further affect model performance after retraining on the cleaned data. With careful design of the approach, we find that mislabel removal leads per-class performance improvements of up to 8% of a retrained classifier in smaller data regimes.

Overview

  • The study focuses on evaluating automated mislabel detection methods in real-world vision datasets, highlighting the problems caused by labeling errors.

  • A new mislabel detection method, SEMD, is introduced and shown to be fast and effective compared to existing techniques.

  • The effectiveness of SEMD is demonstrated on datasets such as CheXpert and METER-ML, with significant improvements in classification accuracy when mislabeled data is removed.

  • Different strategies for mislabel detection in multi-label tasks are explored, with combined approaches often yielding the best results.

  • The research provides a comprehensive analysis that will help improve data-cleaning methods, enhancing the reliability and performance of machine learning models.

In recent years, the field of computer vision has seen impressive advancements, largely attributed to the use of labeled datasets. However, these datasets often contain labeling errors that can impede the performance of machine learning models. These labeling errors are especially problematic in critical areas such as medical diagnosis, where precise and accurate data labeling is crucial.

Label errors in datasets occur for many reasons, such as human error during manual labeling or inaccuracies in auto-labeling algorithms. To counter this, various automated mislabel detection techniques have been developed. Despite their potential, these methods were predominantly validated on datasets containing synthetically introduced noise, and their effectiveness on real-world data has remained largely unexplored.

An examination was conducted to evaluate over 200 experiments that benchmark these automated mislabel detection methods. The study compared these methods on multiple datasets, considering different types of synthetically introduced and real noise, with varying noise levels. Among these methods is a new approach crafted specifically for this study, called Simple and Efficient Mislabel Detector (SEMD), which performed either comparably or better than existing techniques while being substantially faster.

In an applied context, SEMD was tested on CheXpert, a dataset containing chest X-rays, and METER-ML, a multi-sensor image dataset labeled for methane emissions, which both come with their own challenges in terms of labeling errors. The findings indicate that removing mislabeled data using SEMD can lead to significant improvements in classification accuracy, especially in datasets that are not exceedingly large.

In scenarios of multi-label tasks, where an example could be associated with multiple labels, the study explored different strategies for mislabel detection and removal. These strategies ranged from per-image to per-label approaches, and the optimal performance was often achieved by combining these strategies and tailoring them to the specific needs of the task at hand.

The study is both extensive and detailed, offering a comprehensive analysis across a variety of settings—synthetic and real noise levels, dataset sizes, and differing removal strategies. Furthermore, it contributes to the field by proposing an effective and efficient approach for mislabel detection that is well-suited to the complexities of real-world datasets. The insights provided by this study facilitate practitioners in designing better data-cleaning methods, thus improving the robustness and accuracy of the resulting machine learning models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.