Improving Topic Models with Latent Feature Word Representations (1810.06306v1)

Published 15 Oct 2018 in cs.CL, cs.IR, and cs.LG

Abstract: Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

Citations (333)

View on Semantic Scholar

Summary

The paper introduces latent feature-enhanced LDA and DMM models that improve word-topic mapping with pre-trained word vectors.
Enhanced models significantly boost topic coherence and document clustering, especially on small or short-text datasets using NPMI and NMI metrics.
Empirical results show superior document classification with improved F1 scores, highlighting the benefit of incorporating external semantic knowledge.

Improving Topic Models with Latent Feature Word Representations

The paper authored by Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson introduces an innovative approach to topic modeling by incorporating latent feature word representations. This approach aims to enhance the performance of probabilistic topic models, specifically Latent Dirichlet Allocation (LDA) and Dirichlet Multinomial Mixture (DMM), by integrating information from large external corpora through pre-trained latent feature word vectors.

Overview of the Proposed Models

The authors propose two new models: the latent feature-enhanced LDA (lf-lda) and the latent feature-enhanced DMM (lf-dmm). The core idea behind these models is to improve the word-topic mapping by leveraging latent feature vector representations trained on vast corpora, thereby enriching the information available for topic modeling on smaller, possibly less informative datasets.

The methodology replaces the conventional topic-to-word Dirichlet multinomial component in both LDA and DMM with a mixture of this component and a latent feature vector component. In essence, the enhanced models utilize a pre-trained set of word vectors to better approximate how words in a document relate to underlying topics.

Empirical Evaluation

The paper provides an extensive empirical evaluation of the new models across several datasets of varying size and document length: the 20-Newsgroups, TagMyNews news, and Sanders Twitter datasets. The evaluation metrics include topic coherence, document clustering, and document classification tasks. Notably, the latent feature models demonstrate significant improvements in topic coherence and document classification accuracy, particularly on small or short-text datasets.

Topic Coherence: The enhanced models consistently outperform the baseline LDA and DMM models in terms of topic coherence, as measured by the normalized pointwise mutual information (NPMI) metric. The authors attribute this to the pre-trained vectors’ ability to capture word semantics from larger corpora, which assists in generating more thematically coherent topics.
Document Clustering: For document clustering, the latent feature models achieve higher purity and normalized mutual information (NMI) scores compared to the baseline models, particularly on datasets with short or fewer documents. This suggests that the proposed models offer a better representation of document-topic associations by utilizing external knowledge.
Document Classification: Similarly, for document classification, the enhanced models exhibit superior performance, with notable improvements in $F_1$ score metrics. The improvements are most pronounced on the smaller datasets, reinforcing the models’ capability to effectively harness information from the latent feature vectors.

Technical Insights and Implications

The incorporation of latent feature vectors addresses a key limitation in traditional topic models, where performance may degrade on small or sparsely-populated datasets. By infusing external semantic knowledge, the proposed models yield topics that align more closely with human understanding, enhancing both the interpretability and practical applicability of topic modeling outputs.

The paper’s methodology invites further exploration into how latent feature representations might be optimized (e.g., fine-tuned) for specific applications or datasets. Moreover, the results suggest future research could explore integrating additional sources of external information or adapting the models for online learning scenarios to handle larger, streaming datasets more efficiently.

Conclusion

Nguyen et al. contribute significantly to the field of topic modeling by demonstrating the efficacy of latent feature word representations in improving model performance. These advancements hold promise for applications requiring robust topic discovery and classification, particularly in domains where data availability is constrained. The integration of external knowledge through latent feature vectors represents a meaningful step forward in the development of more accurate and reliable topic models. Future work in this area could further optimize and extend these techniques, potentially transforming their utility across various natural language processing tasks.

PDF Markdown