Finding Outliers in Gaussian Model-Based Clustering (1907.01136v6)

Published 2 Jul 2019 in stat.ME and stat.ML

Abstract: Clustering, or unsupervised classification, is a task often plagued by outliers. Yet there is a paucity of work on handling outliers in clustering. Outlier identification algorithms tend to fall into three broad categories: outlier inclusion, outlier trimming, and post hoc outlier identification methods, with the former two often requiring pre-specification of the number of outliers. The fact that sample squared Mahalanobis distance is beta-distributed is used to derive an approximate distribution for the log-likelihoods of subset finite Gaussian mixture models. An algorithm is then proposed that removes the least plausible points according to the subset log-likelihoods, which are deemed outliers, until the subset log-likelihoods adhere to the reference distribution. This results in a trimming method, called OCLUST, that inherently estimates the number of outliers.

References (58)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/StatMLPapers/status/1777185693569016227

https://twitter.com/statCOpapers/status/1777532144061128979

Finding Outliers in Gaussian Model-Based Clustering (1907.01136v6)

Summary

Related Papers

Tweets