Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Finding Outliers in Gaussian Model-Based Clustering (1907.01136v6)

Published 2 Jul 2019 in stat.ME and stat.ML

Abstract: Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. “Extending mixtures of multivariate t-factor analyzers.” Statistics and Computing, 21(3): 361–373 (2011).
  2. ‘‘The multivariate leptokurtic-normal distribution and its application in model-based clustering.” Canadian Journal of Statistics, 45(1):95–119 (2017).
  3. “Model-based Gaussian and non-Gaussian clustering.” Biometrics, 49(3):803–821 (1993).
  4. “Outlier detection: Methods, models, and classification.” ACM Computing Surveys (CSUR), 53(3):1–37 (2020).
  5. “Outlier detection in large data sets.” Computers & Chemical Engineering, 35(2):388–390 (2011).
  6. “A Multivariate Study of Variation in Two Species of Rock Crab of Genus Leptograpsus.” Australian Journal of Zoology, 22:417–425 (1974).
  7. oclust: Gaussian Model-Based Clustering with Outliers (2022). R package version 0.2.0. URL https://CRAN.R-project.org/package=oclust
  8. “Trimmed k𝑘kitalic_k-means: an attempt to robustify quantizers.” The Annals of Statistics, 25(2):553–576 (1997).
  9. “Mixtures of Multivariate Power Exponential Distributions.” Biometrics, 71(4):1081–1089 (2015).
  10. “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society: Series B, 39(1):1–38 (1977).
  11. “A comparative evaluation of outlier detection algorithms: Experiments and analyses.” Pattern Recognition, 74:406–421 (2018).
  12. “A density-based algorithm for discovering clusters in large spatial databases with noise.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD–96), 226–231. AAAI Press (1996).
  13. “Outlier identification in model-based cluster analysis.” Journal of Classification, 32(1):63 (2015).
  14. Fränti, P. “Efficiency of random swap clustering.” Journal of Big Data, 5(1):13 (2018).
  15. “Centroid index: Cluster level similarity measure.” Pattern Recognition, 47(9):3034–3045 (2014).
  16. “K-means properties on six clustering benchmark datasets.” Applied Intelligence, 48(12):4743–4759 (2018).
  17. “Medoid-Shift for Noise Removal to Improve Clustering.” In International Conference on Artificial Intelligence and Soft Computing, 604–614. Springer (2018).
  18. “A General Trimming Approach to Robust Cluster Analysis.” The Annals of Statistics, 36(3):1324–1345 (2008).
  19. “Exploring the Number of Groups in Robust Model-Based Clustering.” Statistics and Computing, 21(4):585–599 (2011).
  20. “The EM algorithm for factor analyzers.” Technical Report CRG-TR-96-1, University of Toronto, Toronto, Canada (1997).
  21. “Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data.” Biometrics, 28(1):81–124 (1972).
  22. Grubbs, F. E. “Procedures for detecting outlying observations in samples.” Technometrics, 11(1):1–21 (1969).
  23. “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software, 91(1):1–30 (2019).
  24. Hampel, F. R. “The influence curve and its role in robust estimation.” Journal of the American Statistical Association, 69(346):383–393 (1974).
  25. “Improving k-means by outlier removal.” In Scandinavian Conference on Image Analysis, 978–987. Springer (2005).
  26. “Comparing Partitions.” Journal of Classification, 2(1):193–218 (1985).
  27. Hurley, C. gclus: Clustering Graphics (2019). R package version 1.3.2. URL https://CRAN.R-project.org/package=gclus
  28. ‘‘Algorithms for mining distance-based outliers in large datasets.” In VLDB, volume 98, 392–403. Citeseer (1998).
  29. Kuiper, N. H. “Tests concerning random points on a circle.” Indagationes Mathematicae (Proceedings), 63: 38–47 (1960).
  30. Kvalseth, T. O. “Entropy and correlation: Some comments.” IEEE Transactions on Systems, Man, and Cybernetics, 17(3):517–519 (1987).
  31. McNicholas, P. D. Mixture Model-Based Clustering and Classification. Boca Raton: Chapman and Hall/CRC Press (2016a).
  32. McNicholas, P. D. “Model-Based Clustering.” Journal of Classification, 33(3):331–373 (2016b).
  33. “Parsimonious Gaussian Mixture Models.” Statistics and Computing, 18(3):285–296 (2008).
  34. “Model-based clustering of microarray expression data via latent Gaussian mixture models.” Bioinformatics, 26(21):2705–2712 (2010).
  35. “A note on the calculation of empirical P values from Monte Carlo procedures.” The American Journal of Human Genetics, 71(2):439–441 (2002).
  36. “Robust mixture modelling using the t distribution.” Statistics and Computing, 10(4):339–348 (2000).
  37. “A review of novelty detection.” Signal Processing, 99:215–249 (2014).
  38. mixture: Mixture Models for Clustering and Classification (2024). R package version 2.1.1. URL https://CRAN.R-project.org/package=mixture
  39. “High-dimensional unsupervised classification via parsimonious contaminated mixtures.” Pattern Recognition, 98:107031 (2020).
  40. “ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions.” Journal of Statistical Software, 85(10):1–25 (2018).
  41. “Parsimonious mixtures of multivariate contaminated normal distributions.” Biometrical Journal, 58(6):1506–1537 (2016).
  42. “Separation index and partial membership for clustering.” Computational Statistics & Data Analysis, 50(3):585–603 (2006).
  43. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation) (2015). R package version 1.3.4. URL https://CRAN.R-project.org/package=clusterGeneration
  44. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2023). URL https://www.R-project.org/
  45. “Efficient algorithms for mining outliers from large data sets.” In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 427–438 (2000).
  46. Ritter, G. Robust Cluster Analysis and Variable Selection, London: Chapman and Hall/CRC Press (2014).
  47. energy: E-Statistics: Multivariate Inference via the Energy of Data (2022). R package version 1.7-11. URL https://CRAN.R-project.org/package=energy
  48. Schwarz, G. “Estimating the dimension of a model.” The Annals of Statistics, 6(2): 461–464 (1978).
  49. “mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models.” The R Journal, 8(1):205–233 (2016).
  50. “Robust mixture clustering using Pearson type VII distribution.” Pattern Recognition Letters, 31:2447–2454 (2010).
  51. “Energy statistics: A class of statistics based on distances.” Journal of Statistical Planning and Inference, 143(8):1249–1272 (2013).
  52. “Model-based clustering via new parsimonious mixtures of heavy-tailed distributions.” AStA Advances in Statistical Analysis, 106: 315–347 (2022).
  53. Modern Applied Statistics with S, 4th ed. New York: Springer (2016).
  54. “Gaussian mixture modeling by exploiting the Mahalanobis distance.” IEEE Transactions on Signal Processing, 56(7):2797–2811 (2008).
  55. “The infinite Student’s t-factor mixture analyzer for robust clustering and classification.” Pattern Recognition, 45:4346–4357 (2012).
  56. Yang, J. “MeanShift-OD.” http://cs.uef.fi/sipu/soft/MeanShift-OD.py (n.d.).
  57. “Outlier detection: how to threshold outlier scores?” In Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, 37, (2019).
  58. “Mean-shift outlier detection and filtering.” Pattern Recognition, 115:107874 (2021).
Citations (2)

Summary

We haven't generated a summary for this paper yet.