Hierarchical clustering with dot products recovers hidden tree structure (2305.15022v3)
Abstract: In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
- Amazon product reviews dataset. https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification, a. Accessed: 2023-05-16.
- S&P500 stock data. https://www.kaggle.com/datasets/camnugent/sandp500, b. Accessed: 2023-05-16.
- S&P500 hierarchy. https://en.wikipedia.org/wiki/List_of_S%26P_500_companies, c. Accessed: 2023-05-16.
- Wikipedia list of phylogenetics software. https://en.m.wikipedia.org/wiki/List_of_phylogenetics_software. Accessed: 2023-05-16.
- Subquadratic high-dimensional hierarchical clustering. Advances in Neural Information Processing Systems, 32, 2019.
- Fair hierarchical clustering. Advances in Neural Information Processing Systems, 33:21050–21060, 2020.
- Optics: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2):49–60, 1999.
- Fnets: Factor-adjusted network estimation and forecasting for high-dimensional time series. arXiv preprint arXiv:2201.06110, 2022.
- Density-based clustering based on hierarchical density estimates. In Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, April 14-17, 2013, Proceedings, Part II 17, pages 160–172. Springer, 2013.
- Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res., 11(Apr):1425–1470, 2010.
- Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 841–854. SIAM, 2017.
- Maximizing agreements for ranking, clustering and hierarchical clustering via max-cut. In International Conference on Artificial Intelligence and Statistics, pages 1657–1665. PMLR, 2021.
- Fair algorithms for hierarchical agglomerative clustering. In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), pages 206–211. IEEE, 2022.
- Hierarchical clustering beyond the worst-case. Advances in Neural Information Processing Systems, 30, 2017.
- Hierarchical clustering: Objective functions and algorithms. Journal of the ACM (JACM), 66(4):1–42, 2019.
- Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 118–127, 2016.
- Efficient algorithms for agglomerative hierarchical clustering methods. Journal of classification, 1(1):7–24, 1984.
- Ronald W DeBry. The consistency of several phylogeny-inference methods under varying evolutionary rates. Molecular Biology and Evolution, 9(3):537–551, 1992.
- Roland L Dobrushin. Central limit theorem for nonstationary markov chains. i. Theory of Probability & Its Applications, 1(1):65–80, 1956.
- Beyond hartigan consistency: Merge distortion metric for hierarchical clustering. In Conference on Learning Theory, pages 588–606. PMLR, 2015.
- A method for comparing two hierarchical clusterings. Journal of the American statistical association, 78(383):553–569, 1983.
- Cure: An efficient clustering algorithm for large databases. ACM Sigmod record, 27(2):73–84, 1998.
- Comparing partitions. Journal of classification, 2:193–218, 1985.
- Algorithms for clustering data. Prentice-Hall, Inc., 1988.
- Stephen C Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.
- Maurice George Kendall. Rank correlation methods. Griffin, 1948.
- Statistical inference for cluster trees. Advances in Neural Information Processing Systems, 29, 2016.
- Probabilistic graphical models: principles and techniques. MIT press, 2009.
- Ken Lang. Newsweeder: Learning to filter netnews. In Machine learning proceedings 1995, pages 331–339. Elsevier, 1995.
- Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.
- Bogdan-Adrian Manghiuc and He Sun. Hierarchical clustering: o(1)𝑜1o(1)italic_o ( 1 )-approximation for well-clustered graphs. Advances in Neural Information Processing Systems, 34:9278–9289, 2021.
- The evolutionary origins of hierarchy. PLoS computational biology, 12(6):e1004829, 2016.
- Scalable hierarchical clustering with tree grafting. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1438–1448, 2019.
- Variation in the molecular clock of primates. Proceedings of the National Academy of Sciences, 113(38):10607–10612, 2016.
- Hierarchical clustering in general metric spaces using approximate nearest neighbors. In International Conference on Artificial Intelligence and Statistics, pages 2440–2448. PMLR, 2021.
- Fionn Murtagh. A survey of recent advances in hierarchical clustering algorithms. The computer journal, 26(4):354–359, 1983.
- Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97, 2012.
- Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
- Magda Peligrad. An invariance principle for φ𝜑\varphiitalic_φ-mixing sequences. The Annals of Probability, pages 1304–1313, 1985.
- William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
- A survey of partitional and hierarchical clustering algorithms. In Data clustering, pages 87–110. Chapman and Hall/CRC, 2018.
- Stability of density-based clustering. Journal of Machine Learning Research, 13:905, 2012.
- Hierarchical clustering via spreading metrics. Advances in Neural Information Processing Systems, 29, 2016.
- A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38:1409–1438, 1958.
- The comparison of dendrograms by objective methods. Taxon, pages 33–40, 1962.
- Scaling hierarchical agglomerative clustering to billion-sized datasets. arXiv preprint arXiv:2105.11653, 2021.
- Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science, 360(6392):981–987, 2018.
- Joe H Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236–244, 1963.
- Discovering latent topology and geometry in data: a law of large dimension. arXiv preprint arXiv:2208.11665, 2022.
- Scanpy: large-scale single-cell gene expression data analysis. Genome biology, 19:1–5, 2018.
- On complete convergence for weighted sums of-mixing random variables. Journal of Inequalities and Applications, 2010:1–13, 2010.
- Birch: an efficient data clustering method for very large databases. ACM sigmod record, 25(2):103–114, 1996.
- Annie Gray (3 papers)
- Alexander Modell (8 papers)
- Patrick Rubin-Delanchy (24 papers)
- Nick Whiteley (39 papers)