STENCIL: Submodular Mutual Information Based Weak Supervision for Cold-Start Active Learning (2402.13468v2)
Abstract: As supervised fine-tuning of pre-trained models within NLP applications increases in popularity, larger corpora of annotated data are required, especially with increasing parameter counts in LLMs. Active learning, which attempts to mine and annotate unlabeled instances to improve model performance maximally fast, is a common choice for reducing the annotation cost; however, most methods typically ignore class imbalance and either assume access to initial annotated data or require multiple rounds of active learning selection before improving rare classes. We present STENCIL, which utilizes a set of text exemplars and the recently proposed submodular mutual information to select a set of weakly labeled rare-class instances that are then strongly labeled by an annotator. We show that STENCIL improves overall accuracy by $10\%-18\%$ and rare-class F-1 score by $17\%-40\%$ on multiple text classification datasets over common active learning methods within the class-imbalanced cold-start setting.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- A survey of deep active learning. ACM computing surveys (CSUR), 54(9):1–40, 2021.
- Burr Settles. Active learning literature survey. 2009.
- Effective evaluation of deep active learning on image classification tasks. arXiv preprint arXiv:2106.15324, 2021.
- Active learning for imbalanced data under cold start. In Proceedings of the Second ACM International Conference on AI in Finance, pages 1–9, 2021.
- Cold start active learning strategies in the context of imbalanced classification. arXiv preprint arXiv:2201.10227, 2022.
- Merging weak and active supervision for semantic parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8536–8543, 2020.
- Active learning on a budget: Opposite strategies suit high and low budgets. arXiv preprint arXiv:2202.02794, 2022.
- Cold-start active learning through self-supervised language modeling. arXiv preprint arXiv:2010.09535, 2020.
- Cold-start active learning for image classification. Information Sciences, 616:16–36, 2022.
- Active data discovery: Mining unknown data using submodular information measures. arXiv preprint arXiv:2206.08566, 2022.
- Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access, 2017.
- Semi-supervised data programming with subset selection. arXiv preprint arXiv:2008.09887, 2020.
- Learning to robustly aggregate labeling functions for semi-supervised data programming. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 1188–1202, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Enhancing active learning with weak supervision and transfer learning by leveraging information and knowledge sources. IAL@ PKDD/ECML, 2022.
- Submodular combinatorial information measures with applications in machine learning. In Algorithmic Learning Theory, pages 722–754. PMLR, 2021.
- Generalized submodular information measures: Theoretical properties, examples, optimization algorithms, and applications. IEEE Transactions on Information Theory, 68(2):752–781, 2021.
- Prism: A rich class of parameterized submodular information measures for guided data subset selection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10238–10246, 2022.
- An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14:265–294, 1978.
- Similar: Submodular information measures based active learning in realistic scenarios. Advances in Neural Information Processing Systems, 34:18685–18697, 2021.
- Talisman: targeted active learning for object detection with rare classes and slices using submodular mutual information. In European Conference on Computer Vision, pages 1–16. Springer, 2022.
- Beyond active learning: Leveraging the full potential of human interaction via auto-labeling, human correction, and human verification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2881–2889, 2024.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques: Proceedings of the 8th IFIP Conference on Optimization Techniques Würzburg, September 5–9, 1977, pages 234–243. Springer, 2005.
- Lazier than lazy greedy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
- A memoization framework for scaling submodular optimization to large scale problems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2340–2349. PMLR, 2019.
- Tubespam: Comment spam filtering on youtube. In 2015 IEEE 14th international conference on machine learning and applications (ICMLA), pages 138–143. IEEE, 2015.
- Contributions to the study of sms spam filtering: new collection and results. In Proceedings of the 11th ACM symposium on Document engineering, pages 259–262, 2011.
- An ensemble sentiment classification system of twitter data for airline services analysis. In 2015 IEEE international conference on data mining workshop (ICDMW), pages 1318–1325. IEEE, 2015.
- k-means++: The advantages of careful seeding. In Soda, volume 7, pages 1027–1035, 2007.
- Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- OpenAI. Openai: Introducing chatgpt, 2023.
- Submodlib: A submodular optimization library. arXiv preprint arXiv:2202.10680, 2022.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.