Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings (1602.02373v2)

Published 7 Feb 2016 in stat.ML, cs.CL, and cs.LG

Abstract: One-hot CNN (convolutional neural network) has been shown to be effective for text categorization (Johnson & Zhang, 2015). We view it as a special case of a general framework which jointly trains a linear model with a non-linear feature generator consisting of `text region embedding + pooling'. Under this framework, we explore a more sophisticated region embedding method using Long Short-Term Memory (LSTM). LSTM can embed text regions of variable (and possibly large) sizes, whereas the region size needs to be fixed in a CNN. We seek effective and efficient use of LSTM for this purpose in the supervised and semi-supervised settings. The best results were obtained by combining region embeddings in the form of LSTM and convolution layers trained on unlabeled data. The results indicate that on this task, embeddings of text regions, which can convey complex concepts, are more useful than embeddings of single words in isolation. We report performances exceeding the previous best results on four benchmark datasets.

Citations (253)

View on Semantic Scholar

Summary

The paper demonstrates that LSTM-based region embeddings effectively overcome CNN limitations by processing variable-length text segments for superior categorization accuracy.
The methodology simplifies supervised learning by eliminating pre-trained word embeddings and directly inputting one-hot vectors, thereby enhancing computational efficiency.
Empirical results on benchmarks like IMDB and Elec reveal that integrating LSTM and CNN embeddings establishes new state-of-the-art performance in text classification.

Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings

The paper "Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings" by Rie Johnson and Tong Zhang investigates the application of Long Short-Term Memory (LSTM) networks within a unified framework aimed at enhancing text categorization. This paper introduces a general architecture characterized by 'region embedding + pooling', subsuming methodologies such as one-hot CNNs and extends their capabilities by employing LSTM for text region embeddings.

Technical Overview

Text categorization, a fundamental task of assigning labels to documents, has traditionally relied on linear predictors using bag-of-words or bag-of-n-grams. More recently, non-linear models have demonstrated superior performance by leveraging word order. Within this domain, convolutional neural networks have been recognized for converting short textual regions into fixed-size embeddings, yet they are constrained by requiring a predefined region size. By contrast, the LSTM approach can process text regions of variable and larger sizes, addressing CNN's limitation.

In this paper, the LSTM architecture is assessed both in supervised and semi-supervised scenarios. Supervised LSTMs are simplified by removing the traditional word embedding layer, allowing one-hot vectors as direct input, which improves accuracy and computational efficiency. Semi-supervised learning leverages LSTMs to obtain region embeddings from unlabeled data, providing additional input to enhance the categorization models.

Numerical Results and Comparisons

The empirical results are robust, with evaluations on several benchmark datasets including IMDB, Elec, RCV1, and 20 Newsgroups. The one-hot bidirectional LSTM with pooling method, o-2LSTM-p, shows surpassed performance when compared to traditional one-hot CNNs or LSTMs with pre-trained word embeddings. Notably, the model achieves new state-of-the-art results on datasets like IMDB and Electronics (Elec), indicating that region embeddings learned via LSTMs can effectively capture higher-level text semantics beyond individual word embeddings.

Furthermore, the research asserts that combining region embeddings derived from both LSTMs and CNNs results in further performance gains. This indicates complementarity between the two methodologies, highlighting that LSTMs' ability to handle variable lengths and CNNs' strength in handling fixed-size inputs can be synergistically harnessed.

Implications and Future Directions

The paper demonstrates the significant potential of LSTMs for text categorization, primarily due to their capacity to embed and pool information from variable-length text regions. This capacity could extend beyond categorization, offering insights into exploring complex text modeling tasks where understanding deeper contextual relationships is essential.

By advocating direct use of one-hot vectors for LSTM inputs, this paper also challenges the prevailing practice of employing pre-trained word embeddings, suggesting that comprehensive region embeddings can be derived efficiently with sufficient training data. Future research may explore the utility of LSTM-derived region embeddings in other NLP tasks, such as sentiment analysis, document summarization, and beyond, potentially refining the methodologies for real-world applications with high accuracy and efficiency. Additionally, the prospects for further hybrid models that integrate distinct neural network strengths could yield even more advanced text analysis frameworks.