Finding Outliers in Gaussian Model-Based Clustering (1907.01136v6)
Abstract: Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.
- “Extending mixtures of multivariate t-factor analyzers.” Statistics and Computing, 21(3): 361–373 (2011).
- ‘‘The multivariate leptokurtic-normal distribution and its application in model-based clustering.” Canadian Journal of Statistics, 45(1):95–119 (2017).
- “Model-based Gaussian and non-Gaussian clustering.” Biometrics, 49(3):803–821 (1993).
- “Outlier detection: Methods, models, and classification.” ACM Computing Surveys (CSUR), 53(3):1–37 (2020).
- “Outlier detection in large data sets.” Computers & Chemical Engineering, 35(2):388–390 (2011).
- “A Multivariate Study of Variation in Two Species of Rock Crab of Genus Leptograpsus.” Australian Journal of Zoology, 22:417–425 (1974).
- oclust: Gaussian Model-Based Clustering with Outliers (2022). R package version 0.2.0. URL https://CRAN.R-project.org/package=oclust
- “Trimmed k𝑘kitalic_k-means: an attempt to robustify quantizers.” The Annals of Statistics, 25(2):553–576 (1997).
- “Mixtures of Multivariate Power Exponential Distributions.” Biometrics, 71(4):1081–1089 (2015).
- “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society: Series B, 39(1):1–38 (1977).
- “A comparative evaluation of outlier detection algorithms: Experiments and analyses.” Pattern Recognition, 74:406–421 (2018).
- “A density-based algorithm for discovering clusters in large spatial databases with noise.” In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD–96), 226–231. AAAI Press (1996).
- “Outlier identification in model-based cluster analysis.” Journal of Classification, 32(1):63 (2015).
- Fränti, P. “Efficiency of random swap clustering.” Journal of Big Data, 5(1):13 (2018).
- “Centroid index: Cluster level similarity measure.” Pattern Recognition, 47(9):3034–3045 (2014).
- “K-means properties on six clustering benchmark datasets.” Applied Intelligence, 48(12):4743–4759 (2018).
- “Medoid-Shift for Noise Removal to Improve Clustering.” In International Conference on Artificial Intelligence and Soft Computing, 604–614. Springer (2018).
- “A General Trimming Approach to Robust Cluster Analysis.” The Annals of Statistics, 36(3):1324–1345 (2008).
- “Exploring the Number of Groups in Robust Model-Based Clustering.” Statistics and Computing, 21(4):585–599 (2011).
- “The EM algorithm for factor analyzers.” Technical Report CRG-TR-96-1, University of Toronto, Toronto, Canada (1997).
- “Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data.” Biometrics, 28(1):81–124 (1972).
- Grubbs, F. E. “Procedures for detecting outlying observations in samples.” Technometrics, 11(1):1–21 (1969).
- “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software, 91(1):1–30 (2019).
- Hampel, F. R. “The influence curve and its role in robust estimation.” Journal of the American Statistical Association, 69(346):383–393 (1974).
- “Improving k-means by outlier removal.” In Scandinavian Conference on Image Analysis, 978–987. Springer (2005).
- “Comparing Partitions.” Journal of Classification, 2(1):193–218 (1985).
- Hurley, C. gclus: Clustering Graphics (2019). R package version 1.3.2. URL https://CRAN.R-project.org/package=gclus
- ‘‘Algorithms for mining distance-based outliers in large datasets.” In VLDB, volume 98, 392–403. Citeseer (1998).
- Kuiper, N. H. “Tests concerning random points on a circle.” Indagationes Mathematicae (Proceedings), 63: 38–47 (1960).
- Kvalseth, T. O. “Entropy and correlation: Some comments.” IEEE Transactions on Systems, Man, and Cybernetics, 17(3):517–519 (1987).
- McNicholas, P. D. Mixture Model-Based Clustering and Classification. Boca Raton: Chapman and Hall/CRC Press (2016a).
- McNicholas, P. D. “Model-Based Clustering.” Journal of Classification, 33(3):331–373 (2016b).
- “Parsimonious Gaussian Mixture Models.” Statistics and Computing, 18(3):285–296 (2008).
- “Model-based clustering of microarray expression data via latent Gaussian mixture models.” Bioinformatics, 26(21):2705–2712 (2010).
- “A note on the calculation of empirical P values from Monte Carlo procedures.” The American Journal of Human Genetics, 71(2):439–441 (2002).
- “Robust mixture modelling using the t distribution.” Statistics and Computing, 10(4):339–348 (2000).
- “A review of novelty detection.” Signal Processing, 99:215–249 (2014).
- mixture: Mixture Models for Clustering and Classification (2024). R package version 2.1.1. URL https://CRAN.R-project.org/package=mixture
- “High-dimensional unsupervised classification via parsimonious contaminated mixtures.” Pattern Recognition, 98:107031 (2020).
- “ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions.” Journal of Statistical Software, 85(10):1–25 (2018).
- “Parsimonious mixtures of multivariate contaminated normal distributions.” Biometrical Journal, 58(6):1506–1537 (2016).
- “Separation index and partial membership for clustering.” Computational Statistics & Data Analysis, 50(3):585–603 (2006).
- clusterGeneration: Random Cluster Generation (with Specified Degree of Separation) (2015). R package version 1.3.4. URL https://CRAN.R-project.org/package=clusterGeneration
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2023). URL https://www.R-project.org/
- “Efficient algorithms for mining outliers from large data sets.” In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 427–438 (2000).
- Ritter, G. Robust Cluster Analysis and Variable Selection, London: Chapman and Hall/CRC Press (2014).
- energy: E-Statistics: Multivariate Inference via the Energy of Data (2022). R package version 1.7-11. URL https://CRAN.R-project.org/package=energy
- Schwarz, G. “Estimating the dimension of a model.” The Annals of Statistics, 6(2): 461–464 (1978).
- “mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models.” The R Journal, 8(1):205–233 (2016).
- “Robust mixture clustering using Pearson type VII distribution.” Pattern Recognition Letters, 31:2447–2454 (2010).
- “Energy statistics: A class of statistics based on distances.” Journal of Statistical Planning and Inference, 143(8):1249–1272 (2013).
- “Model-based clustering via new parsimonious mixtures of heavy-tailed distributions.” AStA Advances in Statistical Analysis, 106: 315–347 (2022).
- Modern Applied Statistics with S, 4th ed. New York: Springer (2016).
- “Gaussian mixture modeling by exploiting the Mahalanobis distance.” IEEE Transactions on Signal Processing, 56(7):2797–2811 (2008).
- “The infinite Student’s t-factor mixture analyzer for robust clustering and classification.” Pattern Recognition, 45:4346–4357 (2012).
- Yang, J. “MeanShift-OD.” http://cs.uef.fi/sipu/soft/MeanShift-OD.py (n.d.).
- “Outlier detection: how to threshold outlier scores?” In Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, 37, (2019).
- “Mean-shift outlier detection and filtering.” Pattern Recognition, 115:107874 (2021).