Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recent Advances in Text Analysis (2401.00775v2)

Published 1 Jan 2024 in stat.AP and cs.IR

Abstract: Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural LLMs. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADStat - a dataset on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods on MADStat leads to interesting findings. For example, $11$ representative topics in statistics are identified. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of $11$ topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research in $1975$--$2015$, from a text analysis perspective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. A practical algorithm for topic modeling with provable guarantees. In International conference on machine learning, pp. 280–288. PMLR.
  2. Learning topic models–going beyond SVD. In IEEE 53rd Annual Symposium on Foundations of Computer Science, pp.  1–10. IEEE.
  3. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  4. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. Bernoulli 26(3).
  5. Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022.
  6. Testing high-dimensional multinomials with applications to text analysis. arXiv preprint arXiv:2301.01381.
  7. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  9. Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics 26(4), 745–766.
  10. Higher criticism for large-scale inference, especially for rare and weak effects. Statistical science 30(1), 1–25.
  11. When does non-negative matrix factorization give a correct decomposition into parts? Advances in Neural Information Processing Systems 16.
  12. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp.  69–78.
  13. Fagan, J. L. (1988). Experiments in automatic phrase indexing for document retrieval: A comparison of syntactic and nonsyntactic methods. Cornell University.
  14. Gillis, N. and S. A. Vavasis (2013). Fast and robust recursive algorithmsfor separable nonnegative matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(4), 698–714.
  15. Harman, D. K. (1993). The first text retrieval conference (TREC-1), Volume 500. US Department of Commerce, National Institute of Standards and Technology.
  16. Long short-term memory. Neural computation 9(8), 1735–1780.
  17. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM.
  18. Horn, R. A. and C. R. Johnson (2013). Matrix Analysis (2nd ed.). Cambridge University Press.
  19. Co-citation and co-authorship networks of statisticians. Journal of Business & Economic Statistics 40(2), 469–485.
  20. Jin, J. (2015). Fast community detection by SCORE. The Annals of Statistics 43(1), 57–89.
  21. Network global testing by counting graphlets. In International Conference on Machine Learning, pp. 2333–2341. PMLR.
  22. Optimal adaptivity of signed-polygon statistics for network testing. The Annals of Statistics 49(6), 3408–3433.
  23. Mixed membership estimation for social networks. Journal of Econometrics.
  24. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
  25. Defining and identifying sleeping beauties in science. Proceedings of the National Academy of Sciences 112(24), 7426–7431.
  26. Special invited paper: The SCORE normalization, especially for heterogeneous network and text data. Stat 12(1).
  27. Predicting returns with text data. Technical report, National Bureau of Economic Research.
  28. Using SVD for topic modeling. Journal of the American Statistical Association October, 1–16.
  29. Assigning topics to documents by successive projections. The Annals of Statistics 51(5), 1989–2014.
  30. Discussion of “Coauthorship and citation networks for statisticians”. Annals of Applied Statistics 10(4), 1835 – 1841.
  31. Lee, D. D. and H. S. Seung (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791.
  32. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240.
  33. Supervised topic models. Advances in Neural Information Processing Systems 20.
  34. A note on EM algorithm for probabilistic latent semantic analysis. In Proceedings of the International Conference on Information and Knowledge Management, CIKM.
  35. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  36. A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems 32(2), 604–624.
  37. Improving language understanding by generative pre-training.
  38. Rahali, A. and M. A. Akhloufi (2023). End-to-end transformer-based models in textual-based NLP. AI 4(1), 54–110.
  39. Weaving the fabric of science: Dynamic network models of science’s unfolding structure. Social Networks 43, 73–85.
  40. Stigler, S. M. (1994). Citation patterns in the journals of statistics and probability. Statistical Science 9, 94–108.
  41. Taddy, M. (2012). On estimation and selection for topic models. In Artificial Intelligence and Statistics, pp.  1184–1193. PMLR.
  42. Statistical modeling of citation exchange between statistics journals (with discussions). Journal of the Royal Statical Society: Series A 179(1), 1–63.
  43. Attention is all you need. Advances in Neural Information Processing Systems 30.
  44. Wallach, H. M. (2006). Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, pp.  977–984.
  45. Sparse topic modeling: Computational efficiency, near-optimal algorithms, and statistical inference. Journal of the American Statistical Association, 1–13.
  46. A heuristic approach to determine an appropriate number of topics in topic modeling. In BMC bioinformatics, Volume 16, pp.  1–10. Springer.
  47. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp.  19–27.
Citations (3)

Summary

We haven't generated a summary for this paper yet.