AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets (2404.05623v2)

Published 8 Apr 2024 in cs.LG and cs.CL

Abstract: Active learning for imbalanced classification tasks is challenging as the minority classes naturally occur rarely. Gathering a large pool of unlabelled data is thus essential to capture minority instances. Standard pool-based active learning is computationally expensive on large pools and often reaches low accuracy by overfitting the initial decision boundary, thus failing to explore the input space and find minority instances. To address these issues we propose AnchorAL. At each iteration, AnchorAL chooses class-specific instances from the labelled set, or anchors, and retrieves the most similar unlabelled instances from the pool. This resulting subpool is then used for active learning. Using a small, fixed-sized subpool AnchorAL allows scaling any active learning strategy to large pools. By dynamically selecting different anchors at each iteration it promotes class balance and prevents overfitting the initial decision boundary, thus promoting the discovery of new clusters of minority instances. In experiments across different classification tasks, active learning strategies, and model architectures AnchorAL is (i) faster, often reducing runtime from hours to minutes, (ii) trains more performant models, (iii) and returns more balanced datasets than competing methods.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces AnchorAL, a method that scales active learning by dynamically filtering the unlabelled pool into balanced subpools.
The paper leverages semantic representations and cosine similarity to accurately identify minority instances through class-specific anchors.
The paper demonstrates that AnchorAL reduces runtime from hours to minutes while improving performance on imbalanced text classification tasks.

Scaling Active Learning for Imbalance Classification: The AnchorAL Approach

Overview

Large pools of unlabelled data pose a significant challenge, especially for imbalanced classification tasks where the minority classes naturally occur rarely. Active Learning (AL) strategies, designed to select informative instances for labeling, tend to be computationally expensive and often ineffective in discovering minority instances due to their iterative nature and dependence on the initial decision boundary. This paper introduces a novel method, AnchorAL, which addresses these challenges by employing a pool filtering mechanism that facilitates the scaling of any AL strategy to large datasets while ensuring class balance and promoting the discovery of new clusters of minority instances.

Active Learning and Class Imbalance

Active Learning in the context of large and imbalanced datasets struggles due to its high computational demands and inability to efficiently explore the input space for minority instances. The computational challenge arises from the need to repeatedly evaluate the model on every unlabelled instance in the pool, which is not practical with current LLM sizes. On the other hand, standard AL strategies often fail to explore the input space adequately due to overfitting the initial decision boundary, thereby missing out on minority instances necessary for improving model performance in real-world applications.

AnchorAL Methodology

AnchorAL introduces a pragmatic approach by selecting class-specific instances (anchors) from the labelled set and using these to retrieve the most similar unlabelled instances from the pool to form a subpool for active learning. This method hinges on the semantic representation capabilities of LLMs to measure similarity based on cosine distances between instance representations, thereby dynamically creating smaller, manageable subpools for each iteration. This process not only reduces the computational load by avoiding the need to evaluate the entire pool but also ensures that the subpool remains balanced and diverse, thus addressing the critical class imbalance issue.

Experiments and Results

The effectiveness of AnchorAL was demonstrated across multiple text classification tasks, AL strategies, and model architectures. The experiments showed that AnchorAL significantly reduced runtime from hours to minutes, improved model performance, and resulted in more balanced datasets. These results underscore the method's ability to combine computational efficiency with enhanced performance, particularly in addressing the challenges posed by imbalanced class distributions.

Implications and Future Directions

AnchorAL's approach to leveraging semantic representations and focusing on class-specific instances for forming subpools introduces a robust framework for scaling AL strategies to large and imbalanced datasets. This method opens up new possibilities for applying AL in real-world settings where computational resources are limited, and class imbalance is a significant challenge. Further research could explore the optimization of anchor selection strategies and the adaptation of AnchorAL to a broader range of tasks and languages.

Conclusion

AnchorAL represents a significant advance in the application of AL to imbalanced classification tasks. By effectively addressing both the computational and learning challenges inherent in large and imbalanced datasets, AnchorAL facilitates the efficient selection of informative instances for labeling. This approach not only enhances model performance but also contributes to a more balanced representation of classes, ultimately leading to more equitable and effective AI systems.

PDF Markdown