Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 119 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 423 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Universal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture Models (2402.15432v2)

Published 23 Feb 2024 in math.ST, cs.LG, stat.ML, and stat.TH

Abstract: Clustering is a pivotal challenge in unsupervised machine learning and is often investigated through the lens of mixture models. The optimal error rate for recovering cluster labels in Gaussian and sub-Gaussian mixture models involves ad hoc signal-to-noise ratios. Simple iterative algorithms, such as Lloyd's algorithm, attain this optimal error rate. In this paper, we first establish a universal lower bound for the error rate in clustering any mixture model, expressed through a Chernoff divergence, a more versatile measure of model information than signal-to-noise ratios. We then demonstrate that iterative algorithms attain this lower bound in mixture models with sub-exponential tails, notably emphasizing location-scale mixtures featuring Laplace-distributed errors. Additionally, for datasets better modelled by Poisson or Negative Binomial mixtures, we study mixture models whose distributions belong to an exponential family. In such mixtures, we establish that Bregman hard clustering, a variant of Lloyd's algorithm employing a Bregman divergence, is rate optimal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp.  670–688. IEEE.
  2. Community recovery in non-binary and temporal stochastic block models. arXiv preprint arXiv:2008.04790.
  3. Bandeira, A. S. and R. van Handel (2016). Sharp nonasymptotic bounds on the norm of random matrices with independent entries. The Annals of Probability 44(4), 2479 – 2506.
  4. Clustering with Bregman divergences. Journal of machine learning research 6(10).
  5. Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis 71, 52–78.
  6. Robust Bregman clustering. The Annals of Statistics 49(3), 1679–1701.
  7. Cutoff for exact recovery of Gaussian mixture models. IEEE Transactions on Information Theory 67(6), 4223–4238.
  8. Chen, X. and A. Y. Zhang (2021). Optimal clustering in anisotropic Gaussian mixture models. arXiv preprint arXiv:2101.05402.
  9. Trimmed k𝑘kitalic_k-means: an attempt to robustify quantizers. The Annals of Statistics 25(2), 553–576.
  10. Tail bounds on the spectral norm of sub-exponential random matrices. Random Matrices: Theory and Applications.
  11. Exact recovery and Bregman hard clustering of node-attributed stochastic block model. In Thirty-seventh Conference on Neural Information Processing Systems.
  12. Community detection in degree-corrected block models. The Annals of Statistics 46(5), 2153 – 2185.
  13. Gao, C. and A. Y. Zhang (2022). Iterative algorithm for discrete structure recovery. The Annals of Statistics 50(2), 1066 – 1094.
  14. A general trimming approach to robust cluster analysis. The Annals of Statistics 36(3), 1324 – 1345.
  15. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences 249, 124–131.
  16. A robust model-based clustering based on the geometric median and the median covariation matrix. Statistics and Computing 34(1), 55.
  17. Validation of noise models for single-cell transcriptomics. Nature methods 11(6), 637–640.
  18. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome medicine 9(1), 1–12.
  19. The elements of statistical learning: data mining, inference, and prediction, Volume 2. Springer.
  20. Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35(1), 73 – 101.
  21. Adversarially robust clustering with optimality guarantees. arXiv preprint arXiv:2306.09977.
  22. A simple linear time (1+ϵitalic-ϵ\epsilonitalic_ϵ)-approximation algorithm for k-means clustering in any dimensions. In 45th Annual IEEE Symposium on Foundations of Computer Science, pp.  454–462. IEEE.
  23. Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting. In 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp.  639–643. IEEE.
  24. Minimax Gaussian classification & clustering. In Artificial Intelligence and Statistics, pp.  1–9.
  25. Lloyd, S. (1982). Least squares quantization in pcm. IEEE transactions on information theory 28(2), 129–137.
  26. Lu, Y. and H. H. Zhou (2016). Statistical and computational guarantees of Lloyd’s algorithm and its variants. arXiv preprint arXiv:1612.02099.
  27. Finite mixture models. Annual review of statistics and its application 6, 355–378.
  28. Minimax supervised clustering in the anisotropic Gaussian mixture model: A new take on robust interpolation. arXiv preprint arXiv:2111.07041.
  29. Ndaoud, M. (2022). Sharp optimal recovery in the two component Gaussian mixture model. The Annals of Statistics 50(4), 2096–2126.
  30. Consistency of Lloyd’s algorithm under perturbations. arXiv preprint arXiv:2309.00578.
  31. Can semi-supervised learning use all the data effectively? A lower bound perspective. In Thirty-seventh Conference on Neural Information Processing Systems.
  32. Tukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to probability and statistics, 448–485.
  33. Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, Volume 47. Cambridge University Press.
  34. Top 10 algorithms in data mining. Knowledge and information systems 14, 1–37.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 4 likes.

Upgrade to Pro to view all of the tweets about this paper: