Prompting Large Language Models for Topic Modeling (2312.09693v1)
Abstract: Topic modeling is a widely used technique for revealing underlying thematic structures within textual data. However, existing models have certain limitations, particularly when dealing with short text datasets that lack co-occurring words. Moreover, these models often neglect sentence-level semantics, focusing primarily on token-level semantics. In this paper, we propose PromptTopic, a novel topic modeling approach that harnesses the advanced language understanding of LLMs to address these challenges. It involves extracting topics at the sentence level from individual documents, then aggregating and condensing these topics into a predefined quantity, ultimately providing coherent topics for texts of varying lengths. This approach eliminates the need for manual parameter tuning and improves the quality of extracted topics. We benchmark PromptTopic against the state-of-the-art baselines on three vastly diverse datasets, establishing its proficiency in discovering meaningful topics. Furthermore, qualitative analysis showcases PromptTopic's ability to uncover relevant topics in multiple datasets.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
- M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth, “The author-topic model for authors and documents,” in Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, ser. UAI ’04. Arlington, Virginia, USA: AUAI Press, 2004, p. 487–494.
- D. Blei and J. Lafferty, “Correlated topic models,” Advances in neural information processing systems, vol. 18, p. 147, 2006.
- C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the β𝛽\betaitalic_β-divergence,” Neural computation, vol. 23, no. 9, pp. 2421–2456, 2011.
- G. Xun, Y. Li, W. X. Zhao, J. Gao, and A. Zhang, “A correlated topic model using word embeddings.” in International Joint Conference on Artificial Intelligence (IJCAI), vol. 17, 2017, pp. 4207–4213.
- A. B. Dieng, F. J. R. Ruiz, and D. M. Blei, “Topic modeling in embedding spaces,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 439–453, Dec. 2020.
- A. Srivastava and C. Sutton, “Autoencoding variational inference for topic models,” International Conference on Learning Representations (ICLR), 2017.
- F. Bianchi, S. Terragni, D. Hovy, D. Nozza, and E. Fersini, “Cross-lingual contextualized topic models with zero-shot learning,” Association for Computational Linguistics (ACL), 2020.
- M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” arXiv preprint arXiv:2203.05794, Mar. 2022.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLAMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023, 1, 2, 3.
- R. K.-W. Lee, T.-A. Hoang, and E.-P. Lim, “On analyzing user topic-specific platform preferences across multiple social media sites,” in Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 1351–1359.
- ——, “Discovering hidden topical hubs and authorities in online social networks,” in Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, 2018, pp. 378–386.
- ——, “Discovering hidden topical hubs and authorities across multiple online social networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 1, pp. 70–84, 2019.
- J. Mcauliffe and D. Blei, “Supervised topic models,” in Advances in Neural Information Processing Systems, vol. 20, 2007.
- D. M. Blei and J. D. Lafferty, “Dynamic topic models,” in Proceedings of the 23rd international conference on Machine learning, June 2006, pp. 113–120.
- T. Hofmann, “Probabilistic latent semantic analysis,” arXiv preprint arXiv:1301.6705, 2013.
- R. Das, M. Zaheer, and C. Dyer, “Gaussian lda for topic models with word embeddings,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 795–804.
- K. Batmanghelich, A. Saeedi, K. Narasimhan, and S. Gershman, “Nonparametric spherical topic modeling with word embeddings,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 537–542. [Online]. Available: https://aclanthology.org/P16-2087
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” International Conference on Learning Representations (ICLR), Jan. 2013.
- Y. Meng, Y. Zhang, J. Huang, Y. Zhang, and J. Han, “Topic discovery via latent space clustering of pretrained language model representations,” in Proceedings of the ACM Web Conference 2022, April 2022, pp. 3143–3152.
- S. Sia, A. Dalmia, and S. J. Mielke, “Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too!” arXiv preprint arXiv:2004.14914, 2020.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171–4186.
- A. M. Hoyle, P. Goel, and P. Resnik, “Improving Neural Topic Models using Knowledge Distillation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics (ACL), Nov. 2020, pp. 1752–1771. [Online]. Available: https://aclanthology.org/2020.emnlp-main.137
- P. Gupta, Y. Chaudhary, and H. Schütze, “Multi-source neural topic modeling in multi-view embedding spaces,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, Jun. 2021, pp. 4205–4217. [Online]. Available: https://aclanthology.org/2021.naacl-main.332
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, …, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
- Z. Chen, M. M. Balan, and K. Brown, “Language models are few-shot learners for prognostic prediction,” arXiv preprint arXiv:2302.12692, Feb. 2023.
- J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, and L. Sifre, “Training compute-optimal large language models,” arXiv preprint, 2022.
- OpenAI, “Gpt-4 technical report,” Tech. Rep., 2023, version b.
- H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint, 2023.
- K. Lang, “Newsweeder: Learning to filter netnews,” in Machine Learning Proceedings 1995. Elsevier, 1995, pp. 331–339.
- “Yelp dataset challenge,” Available from: http://www.yelp.com/dataset_challenge, 2015.
- J. Qiang, Z. Qian, Y. Li, Y. Yuan, and X. Wu, “Short text topic modeling techniques, applications, and performance: A survey,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 3, pp. 1427–1445, Mar. 2022.
- S. Terragni, E. Fersini, B. G. Galuzzi, P. Tropeano, and A. Candelieri, “OCTIS: Comparing and optimizing topic models is simple!” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Apr. 2021, pp. 263–270. [Online]. Available: https://www.aclweb.org/anthology/2021.eacl-demos.31
- G. Bouma, “Normalized (pointwise) mutual information in collocation extraction,” Proceedings of GSCL, vol. 30, pp. 31–40, 2009.
- A. Hoyle, P. Goel, A. Hian-Cheong, D. Peskov, J. Boyd-Graber, and P. Resnik, “Is automated topic model evaluation broken? the incoherence of coherence,” Advances in Neural Information Processing Systems, vol. 34, pp. 2018–2033, 2021.
- J. Chang, S. Gerrish, C. Wang, J. Boyd-Graber, and D. Blei, “Reading tea leaves: How humans interpret topic models,” Advances in Neural Information Processing Systems, vol. 22, 2009.