An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets (2312.02200v1)

Published 2 Dec 2023 in cs.CV, cs.AI, and stat.AP

Abstract: Major advancements in computer vision can primarily be attributed to the use of labeled datasets. However, acquiring labels for datasets often results in errors which can harm model performance. Recent works have proposed methods to automatically identify mislabeled images, but developing strategies to effectively implement them in real world datasets has been sparsely explored. Towards improved data-centric methods for cleaning real world vision datasets, we first conduct more than 200 experiments carefully benchmarking recently developed automated mislabel detection methods on multiple datasets under a variety of synthetic and real noise settings with varying noise levels. We compare these methods to a Simple and Efficient Mislabel Detector (SEMD) that we craft, and find that SEMD performs similarly to or outperforms prior mislabel detection approaches. We then apply SEMD to multiple real world computer vision datasets and test how dataset size, mislabel removal strategy, and mislabel removal amount further affect model performance after retraining on the cleaned data. With careful design of the approach, we find that mislabel removal leads per-class performance improvements of up to 8% of a retrained classifier in smaller data regimes.

References (43)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces SEMD, a novel method that speeds up mislabel detection while matching or surpassing traditional techniques in accuracy.
It systematically benchmarks over 200 experiments across varied noise levels in datasets like CheXpert and METER-ML.
Results demonstrate that removing mislabeled data can significantly enhance classification performance, especially in multi-label scenarios.

In recent years, the field of computer vision has seen impressive advancements, largely attributed to the use of labeled datasets. However, these datasets often contain labeling errors that can impede the performance of machine learning models. These labeling errors are especially problematic in critical areas such as medical diagnosis, where precise and accurate data labeling is crucial.

Label errors in datasets occur for many reasons, such as human error during manual labeling or inaccuracies in auto-labeling algorithms. To counter this, various automated mislabel detection techniques have been developed. Despite their potential, these methods were predominantly validated on datasets containing synthetically introduced noise, and their effectiveness on real-world data has remained largely unexplored.

An examination was conducted to evaluate over 200 experiments that benchmark these automated mislabel detection methods. The paper compared these methods on multiple datasets, considering different types of synthetically introduced and real noise, with varying noise levels. Among these methods is a new approach crafted specifically for this paper, called Simple and Efficient Mislabel Detector (SEMD), which performed either comparably or better than existing techniques while being substantially faster.

In an applied context, SEMD was tested on CheXpert, a dataset containing chest X-rays, and METER-ML, a multi-sensor image dataset labeled for methane emissions, which both come with their own challenges in terms of labeling errors. The findings indicate that removing mislabeled data using SEMD can lead to significant improvements in classification accuracy, especially in datasets that are not exceedingly large.

In scenarios of multi-label tasks, where an example could be associated with multiple labels, the paper explored different strategies for mislabel detection and removal. These strategies ranged from per-image to per-label approaches, and the optimal performance was often achieved by combining these strategies and tailoring them to the specific needs of the task at hand.

The paper is both extensive and detailed, offering a comprehensive analysis across a variety of settings—synthetic and real noise levels, dataset sizes, and differing removal strategies. Furthermore, it contributes to the field by proposing an effective and efficient approach for mislabel detection that is well-suited to the complexities of real-world datasets. The insights provided by this paper facilitate practitioners in designing better data-cleaning methods, thus improving the robustness and accuracy of the resulting machine learning models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sawubonagmbh/status/1837527609615552722