Incremental hierarchical text clustering methods: a review (2312.07769v1)
Abstract: The growth in Internet usage has contributed to a large volume of continuously available data, and has created the need for automatic and efficient organization of the data. In this context, text clustering techniques are significant because they aim to organize documents according to their characteristics. More specifically, hierarchical and incremental clustering techniques can organize dynamic data in a hierarchical form, thus guaranteeing that this organization is updated and its exploration is facilitated. Based on the relevance and contemporary nature of the field, this study aims to analyze various hierarchical and incremental clustering techniques; the main contribution of this research is the organization and comparison of the techniques used by studies published between 2010 and 2018 that aimed to texts documents clustering. We describe the principal concepts related to the challenge and the different characteristics of these published works in order to provide a better understanding of the research in this field.
- Data clustering: a review, ACM computing surveys (CSUR) 31 (1999) 264–323.
- R. Gil-García, A. Pons-Porrata, Dynamic hierarchical algorithms for document clustering, Pattern Recognition Letters 31 (2010a) 469 – 477. URL: http://www.sciencedirect.com/science/article/pii/S0167865509003225. doi:http://dx.doi.org/10.1016/j.patrec.2009.11.011, {CIARP} 2008: Robust and Efficient Analysis of Signals and Images.
- How hierarchical topics evolve in large text corpora, IEEE Transactions on Visualization and Computer Graphics 20 (2014) 2281–2290. doi:10.1109/TVCG.2014.2346433.
- Hierarchical clustering algorithms for document datasets, Data Mining and Knowledge Discovery 10 (2005) 141–168. URL: http://dx.doi.org/10.1007/s10618-005-0361-3. doi:10.1007/s10618-005-0361-3.
- Clustering tagged documents with labeled and unlabeled documents, Inf. Process. Manage. 49 (2013) 596–606. URL: http://dx.doi.org/10.1016/j.ipm.2012.12.004. doi:10.1016/j.ipm.2012.12.004.
- P. Berkhin, A survey of clustering data mining techniques, in: J. Kogan, C. Nicholas, M. Teboulle (Eds.), Grouping Multidimensional Data, Springer Berlin Heidelberg, 2006, pp. 25–71. URL: http://dx.doi.org/10.1007/3-540-28349-8_2. doi:10.1007/3-540-28349-8_2.
- A review of clustering techniques and developments, Neurocomputing 267 (2017) 664 – 681. URL: http://www.sciencedirect.com/science/article/pii/S0925231217311815. doi:https://doi.org/10.1016/j.neucom.2017.06.053.
- A. K. Jain, Data clustering: 50 years beyond k-means, Pattern Recognition Letters 31 (2010) 651 – 666. doi:http://dx.doi.org/10.1016/j.patrec.2009.09.011.
- Collaborative clustering: Why, when, what and how, Information Fusion 39 (2018) 81 – 95. URL: http://www.sciencedirect.com/science/article/pii/S1566253517300027. doi:https://doi.org/10.1016/j.inffus.2017.04.008.
- A. Huang, Similarity measures for text document clustering, in: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49–56.
- P. Hansen, B. Jaumard, Cluster analysis and mathematical programming, Mathematical Programming 79 (1997) 191–215. URL: http://dx.doi.org/10.1007/BF02614317. doi:10.1007/BF02614317.
- Survey of clustering algorithms, Neural Networks, IEEE Transactions on 16 (2005) 645–678.
- Incremental hierarchical clustering of text documents, in: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM ’06, ACM, New York, NY, USA, 2006, pp. 357–366. URL: http://doi.acm.org/10.1145/1183614.1183667. doi:10.1145/1183614.1183667.
- Clustering data streams: Theory and practice, IEEE Transactions on Knowledge and Data Engineering 15 (2003) 515–528. URL: http://dx.doi.org/10.1109/TKDE.2003.1198387. doi:10.1109/TKDE.2003.1198387.
- Data stream clustering: A survey, ACM Comput. Surv. 46 (2013) 13:1–13:31. URL: http://doi.acm.org.ez26.periodicos.capes.gov.br/10.1145/2522968.2522981. doi:10.1145/2522968.2522981.
- Clustering text data streams – a tree based approach with ternary function and ternary feature vector, Procedia Computer Science 31 (2014) 976 – 984. doi:http://dx.doi.org/10.1016/j.procs.2014.05.350, 2nd International Conference on Information Technology and Quantitative Management, {ITQM} 2014.
- D. Wang, T. Li, Document update summarization using incremental hierarchical clustering, in: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, ACM, New York, NY, USA, 2010, pp. 279–288. URL: http://doi.acm.org/10.1145/1871437.1871476. doi:10.1145/1871437.1871476.
- Tracking and connecting topics via incremental hierarchical dirichlet processes, in: Data Mining (ICDM), 2011 IEEE 11th International Conference on, 2011, pp. 1056–1061. doi:10.1109/ICDM.2011.148.
- Dynamic categorization of clinical research eligibility criteria by hierarchical clustering, Journal of Biomedical Informatics 44 (2011) 927 – 935. URL: http://www.sciencedirect.com/science/article/pii/S1532046411001018. doi:http://dx.doi.org/10.1016/j.jbi.2011.06.001.
- Incrementally clustering legislative interpellation documents, in: System Science (HICSS), 2012 45th Hawaii International Conference on, 2012, pp. 2521–2530. doi:10.1109/HICSS.2012.322.
- Hspknn: An effective and practical framework for hot topic detection of internet news, in: Computing and Convergence Technology (ICCCT), 2012 7th International Conference on, 2012, pp. 888–893.
- Faces: Diversity-aware entity summarization using incremental hierarchical conceptual clustering, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, AAAI Press, 2015, pp. 116–122. URL: http://dl.acm.org/citation.cfm?id=2887007.2887024.
- Clustering models for data stream mining, Procedia Computer Science 46 (2015) 619 – 626. doi:http://dx.doi.org/10.1016/j.procs.2015.02.107.
- Y. Zhang, Z. Qu, A novel method for online bursty event detection on twitter, in: Software Engineering and Service Science (ICSESS), volume 6, IEEE, 2015, pp. 284–288. doi:10.1109/ICSESS.2015.7339056.
- Automated learning of domain taxonomies from text using background knowledge, Journal of Biomedical Informatics 63 (2016) 295–306. doi:https://doi.org/10.1016/j.jbi.2016.09.002.
- J. Protasiewicz, S. Dadas, A hybrid knowledge-based framework for author name disambiguation, in: Systems, Man, and Cybernetics (SMC), IEEE, 2016, pp. 000594–000600. doi:10.1109/SMC.2016.7844305.
- Hierarchical document clustering based on cosine similarity measuere, Intelligent Systems and Information Management 1 (2017) 153–159. doi:10.1109/ICISIM.2017.8122166.
- Pattern of writing style evolution by means of dynamic similarity, Pattern Recognition 1 (2018) 45–64.
- Event summarization for sports games using twitter streams, World Wide Web 21 (2018). doi:10.1007/s11280-017-0477-6.
- Arabic text clustering using improved clustering algorithms with dimensionality reduction, Cluster Computing (2018). doi:10.1007/s10586-018-2084-4.
- D. Fisher, Knowledge acquisition via incremental conceptual clustering, Machine Learning 2 (1987) 139–172. URL: http://dx.doi.org/10.1007/BF00114265. doi:10.1007/BF00114265.
- Named entity recognition without gazetteers, in: Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics, EACL ’99, Association for Computational Linguistics, Stroudsburg, PA, USA, 1999, pp. 1–8. URL: http://dx.doi.org/10.3115/977035.977037. doi:10.3115/977035.977037.
- J. MacQueen, et al., Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 14, Oakland, CA, USA., 1967, pp. 281–297.
- An incremental nested partition method for data clustering, Pattern Recognition 43 (2010) 2439 – 2455. doi:http://dx.doi.org/10.1016/j.patcog.2010.01.019.
- R. Gil-García, A. Pons-Porrata, Improving the dynamic hierarchical compact clustering algorithm by using feature selection, in: Proceedings of the 15th Iberoamerican Congress Conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP’10, Springer-Verlag, Berlin, Heidelberg, 2010b, pp. 113–120. URL: http://dl.acm.org/citation.cfm?id=1948207.1948232.
- Incremental document clustering using multi-representation indexing tree, in: Information Science and Engineering (ICISE), 2010 2nd International Conference on, 2010, pp. 3778–3781. doi:10.1109/ICISE.2010.5690332.
- News topic detection based on hierarchical clustering and named entity, in: Natural Language Processing andKnowledge Engineering (NLP-KE), 2011 7th International Conference on, 2011, pp. 280–284. doi:10.1109/NLPKE.2011.6138209.
- On the use of consensus clustering for incremental learning of topic hierarchies, in: Proceedings of the 21st Brazilian Conference on Advances in Artificial Intelligence, SBIA’12, Springer-Verlag, Berlin, Heidelberg, 2012, pp. 112–121. URL: http://dx.doi.org/10.1007/978-3-642-34459-6_12. doi:10.1007/978-3-642-34459-6_12.
- R. M. Marcacini, S. O. Rezende, Incremental hierarchical text clustering with privileged information, in: Proceedings of the 2013 ACM Symposium on Document Engineering, DocEng ’13, ACM, New York, NY, USA, 2013, pp. 231–232. URL: http://doi.acm.org/10.1145/2494266.2494296. doi:10.1145/2494266.2494296.
- Mining evolutionary multi-branch trees from text streams, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, ACM, New York, NY, USA, 2013, pp. 722–730. URL: http://doi.acm.org/10.1145/2487575.2487603. doi:10.1145/2487575.2487603.
- A general framework of hierarchical clustering and its applications, Information Sciences 272 (2014) 29 – 48. URL: http://www.sciencedirect.com/science/article/pii/S0020025514001686. doi:http://dx.doi.org/10.1016/j.ins.2014.02.062.
- Named entities as privileged information for hierarchical text clustering, in: Proceedings of the 18th International Database Engineering & Applications Symposium, IDEAS ’14, ACM, New York, NY, USA, 2014, pp. 57–66. URL: http://doi.acm.org/10.1145/2628194.2628225. doi:10.1145/2628194.2628225.
- T. Peng, L. Liu, A novel incremental conceptual hierarchical text clustering method using cfu-tree, Applied Soft Computing 27 (2015) 269 – 278. URL: http://www.sciencedirect.com/science/article/pii/S1568494614005766. doi:http://dx.doi.org/10.1016/j.asoc.2014.11.015.
- D. Wang, A. Al-Rubaie, Incremental learning with partial-supervision based on hierarchical dirichlet process and the application for document classification, Applied Soft Computing 33 (2015) 250 – 262. URL: http://www.sciencedirect.com/science/article/pii/S1568494615002719. doi:http://dx.doi.org/10.1016/j.asoc.2015.04.044.
- R. Irfan, S. Khan, Tie: an algorithm for incrementally envolving taxonomy for text data, in: IEEE International Conference on Machine Learning and Applications, volume 15, 2016. doi:10.1109/ICMLA.2016.165.
- M. Khalilian, N. Sulaiman, Data stream clustering by divide and conquer approach based on vector model, Journal of Big Data 3 (2016). doi:10.1186/s40537-015-0036-x.
- A hierarchical algorithm for extreme clustering, in: International Conference on Knowledge Discovery and Data Mining, 23, 2017, pp. 255–264. doi:10.1145/3097983.3098079.
- T. Sutanto, R. Nayak, Fine-grained document clustering via ranking and its application to social media analytics, Social Network Analysis and Mining 8 (2018) 29. doi:10.1007/s13278-018-0508-z.
- A comparison of document clustering techniques, in: KDD workshop on Text Mining, 2000.
- Scatter/gather: A cluster-based approach to browsing large document collections, in: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’92, ACM, New York, NY, USA, 1992, pp. 318–329. URL: http://doi.acm.org/10.1145/133160.133214. doi:10.1145/133160.133214.
- Evolutionary clustering, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, ACM, New York, NY, USA, 2006, pp. 554–560. URL: http://doi.acm.org/10.1145/1150402.1150467. doi:10.1145/1150402.1150467.
- Automatic subspace clustering of high dimensional data for data mining applications, SIGMOD Rec. 27 (1998) 94–105. URL: http://doi.acm.org/10.1145/276305.276314. doi:10.1145/276305.276314.
- C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297.
- Comparing svm and naive bayes classifiers for text categorization with wikitology as knowledge enrichment, in: Multitopic Conference (INMIC), 2011 IEEE 14th International, 2011, pp. 31–34. doi:10.1109/INMIC.2011.6151495.
- J.-W. Fan, C. Friedman, Semantic classification of biomedical concepts using distributional similarity, Journal of the American Medical Informatics Association 14 (2007) 467–477. doi:https://doi.org/10.1197/jamia.M2314.
- C. Spearman, The proof and measurement of association between two things, American Journal of Psychology 15 (1904) 88–103.
- G. Lance, W. Williams, Computer programs for hierarchical polythetic classification (”similarity analyses”), The Computer Journal 9 (1966). doi:10.1093/comjnl/9.1.60.
- Ohsumed: An interactive retrieval evaluation and new large test collection for research, in: B. Croft, C. van Rijsbergen (Eds.), SIGIR ’94, Springer London, 1994, pp. 192–201. URL: http://dx.doi.org/10.1007/978-1-4471-2099-5_20. doi:10.1007/978-1-4471-2099-5_20.
- E. M. Voorhees, D. Harman, Overview of the fifth text retrieval conference (trec-5)., in: TREC, 1996.
- B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’99, ACM, New York, NY, USA, 1999, pp. 16–22. URL: http://doi.acm.org/10.1145/312129.312186. doi:10.1145/312129.312186.
- A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval 12 (2009) 461–486. URL: http://dx.doi.org/10.1007/s10791-008-9066-8. doi:10.1007/s10791-008-9066-8.
- C.-Y. Lin, Automatic evaluation of summaries using n-gram co-occurrence statistics, 2003, pp. 71–78. doi:10.3115/1073445.1073465.
- R. Sibson, Slink: An optimally efficient algorithm for the single-link cluster method, The Computer Journal 16 (1973) 30–34. doi:10.1093/comjnl/16.1.30.
- D. Defays, An efficient algorithm for a complete link method, The Computer Journal 20 (1977) 364–366. doi:10.1093/comjnl/20.4.364.
- The star clustering algorithm for static and dynamic information organization., J. Graph Algorithms Appl. 8 (2004) 95–129.
- Extended star clustering algorithm, in: A. Sanfeliu, J. Ruiz-Shulcloper (Eds.), Progress in Pattern Recognition, Speech and Image Analysis, volume 2905 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2003, pp. 480–487. URL: http://dx.doi.org/10.1007/978-3-540-24586-5_59. doi:10.1007/978-3-540-24586-5_59.
- Acons: A new algorithm for clustering documents, in: L. Rueda, D. Mery, J. Kittler (Eds.), Progress in Pattern Recognition, Image Analysis and Applications, volume 4756 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2007, pp. 664–673. URL: http://dx.doi.org/10.1007/978-3-540-76725-1_69. doi:10.1007/978-3-540-76725-1_69.
- New event detection based on indexing-tree and named entity, in: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’07, ACM, New York, NY, USA, 2007, pp. 215–222. URL: http://doi.acm.org/10.1145/1277741.1277780. doi:10.1145/1277741.1277780.
- Online topic detection and tracking of financial news based on hierarchical clustering, in: Machine Learning and Cybernetics (ICMLC), 2010 International Conference on, volume 6, 2010, pp. 3341–3346. doi:10.1109/ICMLC.2010.5580677.
- A general approach for incremental approximation and hierarchical clustering, in: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2006, pp. 1147–1156. URL: http://dl.acm.org/citation.cfm?id=1109557.1109684.
- Birch: An efficient data clustering method for very large databases, SIGMOD Rec. 25 (1996) 103–114. URL: http://doi.acm.org/10.1145/235968.233324. doi:10.1145/235968.233324.
- Streamkm++: A clustering algorithm for data streams, J. Exp. Algorithmics 17 (2012) 2.4:2.1–2.4:2.30. URL: http://doi.acm.org/10.1145/2133803.2184450. doi:10.1145/2133803.2184450.
- D. Sculley, Web-scale k-means clustering, in: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, ACM, New York, NY, USA, 2010, pp. 1177–1178. URL: http://doi.acm.org/10.1145/1772690.1772862. doi:10.1145/1772690.1772862.
- Bico: Birch meets coresets for k-means clustering, in: H. L. Bodlaender, G. F. Italiano (Eds.), Algorithms – ESA 2013, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 481–492.
- A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, AAAI Press, 1996, pp. 226–231. URL: http://dl.acm.org/citation.cfm?id=3001460.3001507.
- Clustering data streams: Theory and practice, IEEE Trans. on Knowl. and Data Eng. 15 (2003) 515–528. URL: http://dx.doi.org/10.1109/TKDE.2003.1198387. doi:10.1109/TKDE.2003.1198387.