Not Enough Data? Deep Learning to the Rescue!

Published 8 Nov 2019 in cs.CL and cs.LG | (1911.03118v2)

Abstract: Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks. We use a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning. We mainly focus on cases with scarce labeled data. Our method, referred to as language-model-based data augmentation (LAMBADA), involves fine-tuning a state-of-the-art language generator to a specific task through an initial training phase on the existing (usually small) labeled data. Using the fine-tuned model and given a class label, new sentences for the class are generated. Our process then filters these new sentences by using a classifier trained on the original data. In a series of experiments, we show that LAMBADA improves classifiers' performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the state-of-the-art techniques for data augmentation, specifically those applicable to text classification tasks with little data.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (341)

View on Semantic Scholar

Summary

The paper introduces LAMBADA, a novel method that leverages fine-tuned language models to generate synthetic labeled data.
It demonstrates significant improvements in classification accuracy across benchmarks, even with as few as five samples per class.
Experimental results validate LAMBADA's superiority over traditional augmentation techniques, highlighting its impact in low-resource NLP scenarios.

LLM-Based Data Augmentation for Text Classification

This paper addresses a fundamental challenge in the field of NLP, specifically in text classification tasks: the scarcity of labeled data. The authors propose a novel method called language-model-based data augmentation (LAMBADA), which synthesizes additional labeled data to enhance classifier performance under limited data availability. This method employs a fine-tuned LLM to generate new labeled sentences, offering a sophisticated solution that surpasses existing data augmentation techniques.

Overview of LAMBADA

LAMBADA leverages generative pre-trained models to artificially generate new sentence data conditioned on class labels. The process begins by fine-tuning a state-of-the-art language generator, such as GPT-2, on a small, existing dataset. This adapted LLM is used to generate new sentences, which are then filtered through a classifier trained on the original data. The results are consistently strong across several benchmark datasets and classification algorithms, demonstrating significant enhancements in the performance of various classifiers.

Experimental Analysis

The paper's experimental section evaluates LAMBADA across different classifiers, including BERT, SVM, and LSTM, and datasets such as ATIS, TREC, and WVA. The results consistently reflect LAMBADA's superiority in scenarios with limited data, substantially improving classification accuracy compared to the baseline and other data augmentation methods like EDA and CBERT. Specifically, LAMBADA showed notable improvements in datasets with as few as five samples per class. The results from McNemar's test statistically validate the improvements, offering robust evidence for LAMBADA's effectiveness.

Implications for Text Classification

This research presents meaningful implications for NLP and text classification. LAMBADA provides a practical alternative to semi-supervised techniques, especially when unlabeled data is inaccessible or costly, thus enabling more effective deployment of machine learning models in niche domains. Additionally, by enhancing performance in low-resource settings, LAMBADA opens the potential for more robust intent classification systems in conversational AI, improved sentiment analysis, and various other classification tasks.

Future Directions

The authors suggest several avenues for future research, including exploring iterative training processes to further refine the generation and filtering of synthetic data. Additionally, an exploration into zero-shot learning applications and alternative filtering heuristics could refine the methods' utility and broaden its applicability. Given the increasing importance of efficient data usage, further developments from this research could significantly influence future NLP methodologies and applications.

In conclusion, this paper contributes a rigorous and statistically validated approach to data augmentation in text classification tasks, with particular efficacy in low-data scenarios. By innovatively leveraging pre-trained generative models, LAMBADA offers a scalable and effective method to overcome data scarcity barriers, making it a noteworthy direction for enhancing machine learning models in NLP.

Markdown Report Issue