Deep Over-sampling Framework for Classifying Imbalanced Data (1704.07515v3)

Published 25 Apr 2017 in cs.LG and stat.ML

Abstract: Class imbalance is a challenging issue in practical classification problems for deep learning models as well as traditional models. Traditionally successful countermeasures such as synthetic over-sampling have had limited success with complex, structured data handled by deep learning models. In this paper, we propose Deep Over-sampling (DOS), a framework for extending the synthetic over-sampling method to exploit the deep feature space acquired by a convolutional neural network (CNN). Its key feature is an explicit, supervised representation learning, for which the training data presents each raw input sample with a synthetic embedding target in the deep feature space, which is sampled from the linear subspace of in-class neighbors. We implement an iterative process of training the CNN and updating the targets, which induces smaller in-class variance among the embeddings, to increase the discriminative power of the deep representation. We present an empirical study using public benchmarks, which shows that the DOS framework not only counteracts class imbalance better than the existing method, but also improves the performance of the CNN in the standard, balanced settings.

Citations (162)

View on Semantic Scholar

Summary

The paper introduces the Deep Over-sampling (DOS) framework, which extends synthetic over-sampling into the deep feature space of CNNs using supervised representation learning to improve feature discriminative power.
Empirical studies demonstrate DOS's superior performance on severely imbalanced datasets and surprisingly, also show improved results on balanced datasets by enhancing overall learned representations.
The DOS framework effectively addresses the challenge of applying traditional over-sampling to deep learning architectures, showing promise for complex classification tasks and real-world applications where data imbalance is common.

Overview of the Deep Over-sampling Framework for Classifying Imbalanced Data

The paper presents a novel approach to dealing with class imbalance in data classification tasks, especially within the framework of deep learning. Traditional methods like synthetic over-sampling, while effective for simpler models, struggle to handle the complex structures that convolutional neural networks (CNNs) process. To address this, the authors introduce the Deep Over-sampling (DOS) framework, which extends synthetic over-sampling into the deep feature space of CNNs.

The core innovation of DOS is its explicit use of supervised representation learning. By introducing synthetic embeddings as targets in the deep feature space, DOS reduces in-class variance among embeddings, thereby enhancing the discriminative power of the features. This approach leverages an iterative process, alternating between CNN training and target updates, to iteratively refine representations and improve classification performance.

Key Contributions and Methodology

Synthetic Over-sampling in Deep Feature Space: The framework formulates over-sampling in the feature space rather than the input space, which allows it to maintain the integrity of the feature distribution while providing class augmentation.
Supervised Representation Learning: DOS uses synthetic instances as supervised learning targets, ensuring that the synthetic data enrich the learning process without deviating significantly from the natural class distributions.
Iterative Learning Process: The proposal involves iteratively updating the CNN and the synthetic targets to continuously improve class distinction.
Application of DOS in Imbalanced and Balanced Settings: The DOS framework was empirically validated on several public datasets, demonstrating superior handling of class imbalance compared to existing methods.

Empirical Findings

The empirical studies on various datasets, including MNIST variants and CIFAR-10, underline DOS's efficacy in skewed class distributions. In severely imbalanced scenarios, DOS showed a slower decline in class-wise recall compared to other methods such as triplet re-sampling and cost-sensitive learning. Remarkably, DOS also improved performance on balanced datasets, suggesting that its effects transcend imbalance correction by enhancing the overall quality of the learned representations.

Implications and Future Directions

The DOS framework addresses a significant gap in the applicability of traditional over-sampling methods to deep learning architectures. Its ability to jointly optimize representation learning and classifier performance positions DOS favorably across a range of complex classification tasks. Future research could extend DOS to other neural network architectures and explore its integration with advanced cost-sensitive learning techniques to push the boundaries of imbalance learning further.

Moreover, adapting DOS to real-world applications, where data imbalance is a commonplace issue, would prove valuable. Understanding the dynamics of DOS in more granular contexts, such as under varied data augmentation strategies and learning rates, could provide deeper insights into its robustness and adaptability.