Mixture-Models: a one-stop Python Library for Model-based Clustering using various Mixture Models (2402.10229v1)
Abstract: \texttt{Mixture-Models} is an open-source Python library for fitting Gaussian Mixture Models (GMM) and their variants, such as Parsimonious GMMs, Mixture of Factor Analyzers, MClust models, Mixture of Student's t distributions, etc. It streamlines the implementation and analysis of these models using various first/second order optimization routines such as Gradient Descent and Newton-CG through automatic differentiation (AD) tools. This helps in extending these models to high-dimensional data, which is first of its kind among Python libraries. The library provides user-friendly model evaluation tools, such as BIC, AIC, and log-likelihood estimation. The source-code is licensed under MIT license and can be accessed at \url{https://github.com/kasakh/Mixture-Models}. The package is highly extensible, allowing users to incorporate new distributions and optimization techniques with ease. We conduct a large scale simulation to compare the performance of various gradient based approaches against Expectation Maximization on a wide range of settings and identify the corresponding best suited approach.
- Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18(5):1007–1065, 2006.
- E. Anderson. The species problem in iris. Annals of the Missouri Botanical Garden, 23(3):457–509, 1936.
- Minimax theory for high-dimensional Gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing Systems, pages 2139–2147, 2013.
- Efficient sparse clustering of high-dimensional non-spherical gaussian mixtures. In Artificial Intelligence and Statistics, pages 37–45, 2015.
- Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
- Learning a mixture of gaussians via mixed-integer optimization. Informs Journal on Optimization, 1(3):221–240, 2019.
- P. J. Bickel and E. Levina. Some theory for fisher’s linear discriminant function,naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989–1010, 2004.
- C. Bouveyron and C. Brunet-Saumard. Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis, 71:52–78, 2014.
- High-dimensional data clustering. Computational Statistics & Data Analysis, 52(1):502–519, 2007.
- Isotropic pca and affine-invariant clustering. In Building Bridges, pages 241–281. Springer, 2008.
- Chime: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics, 47(3):1234–1267, 2019.
- A. Cayley. Sur quelques propriétés des déterminants gauches. 1846.
- W.-C. Chang. On using principal components before separating a mixture of two multivariate normal distributions. Journal of the Royal Statistical Society: Series C (Applied Statistics), 32(3):267–275, 1983.
- J. Chen and X. Tan. Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7):1367–1383, Aug. 2009.
- Penalized maximum likelihood estimator for normal mixtures. Scandinavian Journal of Statistics, 30(1):45–59, 2003.
- S. Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634–644. IEEE, 1999.
- S. Dasgupta and L. J. Schulman. A two-round variant of em for gaussian mixtures. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 152–159, 2000.
- N. E. Day. Estimating the components of a mixture of normal distributions. Biometrika, 56(3):463–474, 1969.
- Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
- D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936.
- Model-based clustering with sparse covariance matrices. Statistics and Computing, 29(4):791–819, 2019.
- Parvus, an extendible package for data exploration, classification and correlation. institute of pharmaceutical and food analysis and technologies. Via Brigata Salerno, 16147, 2008.
- C. Fraley and A. E. Raftery. Enhanced model-based clustering, density estimation, and discriminant analysis software: Mclust. Journal of Classification, 20(2):263–286, 2003.
- Z. Ghahramani and G. Hilton. The EM algorithm for mixture of factor analyzers. Techical Report CRG-TR-96-1, 1997.
- The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996.
- Pairwise variable selection for high-dimensional model-based clustering. Biometrics, 66(3):793–804, 2010.
- P. J. Huber. Robust statistics, 1981.
- S. Ingrassia. A likelihood-based constrained algorithm for multivariate normal mixture models. Statistical Methods and Applications, 13(2):151–166, 2004.
- S. Ingrassia and R. Rocci. Degeneracy of the EM algorithm for the mle of multivariate Gaussian mixtures and dynamic constraints. Computational Statistics & Data Analysis, 55(4):1715–1725, 2011.
- N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002.
- Disentangling gaussians. Communications of the ACM, 55(2):113–120, 2012.
- The spectral method for general mixture models. In International Conference on Computational Learning Theory, pages 444–457. Springer, 2005.
- S. R. Kasa and V. Rajan. Improved inference of gaussian mixture copula model for clustering and reproducibility analysis using automatic differentiation. Econometrics and Statistics, 22:67–97, 2022.
- S. R. Kasa and V. Rajan. Avoiding inferior clusterings with misspecified gaussian mixture models. Scientific Reports, 13(1):19164, 2023.
- Gaussian mixture copulas for high-dimensional clustering and dependency-based subtyping. Bioinformatics, 36(2):621–628, 2020.
- Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6):673–679, 2001.
- Autograd: Effortless gradients in numpy. In ICML 2015 AutoML Workshop, volume 238, 2015.
- Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis, 41(3-4):379–388, 2003.
- P. McNicholas and T. Murphy. Parsimonious Gaussian mixture models. Statistics and Computing, 18:285–296, 2008a.
- Parsimonious Gaussian mixture models. Statistics and Computing, 18:285–296, 2008b.
- Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis, 54(3):711–723, 2010.
- pgmm: Parsimonious Gaussian mixture models, 1, 2011.
- pgmm: Parsimonious Gaussian Mixture Models, 2018. URL https://CRAN.R-project.org/package=pgmm. R package version 1.2.3.
- K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
- W. Pan and X. Shen. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research, 8(May):1145–1164, 2007.
- H. Park and T. Ozeki. Singularity and slow convergence of the EM algorithm for Gaussian mixtures. Neural Processing letters, 29(1):45–59, 2009.
- D. Peel and G. J. McLachlan. Robust mixture modelling using the t distribution. Statistics and computing, 10:339–348, 2000.
- A. Raftery. Discussion of “bayesian clustering with variable and transformation selection” by liu et al. Bayesian Statistics, 7:266–271, 2003.
- A. E. Raftery and N. Dean. Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473):168–178, 2006.
- Optimization with EM and expectation-conjugate-gradient. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 672–679, 2003.
- A. Sanjeev and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 247–257, 2001.
- mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1):289, 2016.
- Model-Based Clustering, Classification, and Density Estimation Using mclust in R. Chapman and Hall/CRC, 2023. ISBN 978-1032234953. doi: 10.1201/9781003277965. URL https://mclust-org.github.io/book/.
- S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841–860, 2004.
- S. Wang and J. Zhu. Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics, 64(2):440–448, 2008.
- C. J. Wu et al. On the convergence properties of the EM algorithm. The Annals of statistics, 11(1):95–103, 1983.
- L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Computation, 8(1):129–151, 1996.
- Principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763–774, 2001.
- Penalized model-based clustering with unconstrained covariance matrices. Electronic Journal of Statistics, 3:1473, 2009.