Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms (1810.05691v4)

Published 12 Oct 2018 in cs.LG and stat.ML

Abstract: Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or more complex distances. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (at comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now also explore alternative strategies for choosing the initial medoids. We also show how the CLARA and CLARANS algorithms benefit from these modifications. It can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100, we observed a 200-fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets as long as we can afford to compute a distance matrix, and in particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the speedup is expected to increase with k).

Citations (252)

View on Semantic Scholar

Summary

The paper introduces a restructured PAM swap phase that leverages a caching mechanism to achieve an O(k)-fold acceleration.
Experimental results show dramatic speedups—up to 200x at k=100—significantly boosting scalability for large and complex datasets.
The methodological improvements extend to CLARA and CLARANS, offering practical benefits in non-Euclidean data analysis for fields like genomics and image processing.

Analyzing Enhancements for $k$ -Medoids Clustering Algorithms: PAM, CLARA, and CLARANS

The paper presented addresses notable enhancements to the seminal Partitioning Around Medoids (PAM) algorithm and its derivatives, CLARA and CLARANS, widely used in clustering non-Euclidean data. These algorithms are pivotal in various domains where dissimilarity measures like Jaccard or Gower are adopted, including biological data analysis. The crux of these enhancements lies in substantial reductions in computational time, primarily targeting substantial improvements in the PAM algorithm's swap phase, titled SWAP.

Core Contributions and Methodological Advancements

The authors propose notable modifications to PAM by restructuring the SWAP phase to achieve substantial computational speedups. By introducing a cache for certain computations and optimizing iteration structures, they bring about an $O(k)$ -fold acceleration for PAM's SWAP phase. Specifically, they store $O(k)$ additional values, enabling a strategic reduction of redundant computations via a smart caching mechanism, significantly cutting down on computational redundancies.

Experimentally, the research recorded dramatic speedups—up to $200 \times$ at $k=100$ —when comparing the new SWAP phase to the original. This enhancement markedly improves PAM's scalability to larger datasets and higher $k$ values, which were challenging in its earlier forms due to computational burdens.

In addition to the optimizations in PAM, the paper discusses application methods of the proposed modifications to its derivatives, CLARA and CLARANS. The experimentations reveal that performance benefits extend to these algorithms as well but highlight that specific strategies like subsampling and randomized searching make CLARANS particularly suitable for even broader datasets.

Theoretical and Practical Implications

Theoretically, the advancements signify an essential leap towards better handling of $k$ -medoids problems, especially in domains demanding significant computational efficiency due to voluminous data. The algorithmic improvements ensure that larger datasets and more complex distance matrices can be tackled effectively, thereby expanding the applicability of $k$ -medoids clustering in practical scenarios.

Practically, these improvements offer a robust alternative to $k$ -means when handling complex dissimilarity measures and offer substantial real-world efficiency gains. This can particularly harness improvements in fields such as large-scale genomic data clustering, image processing, and other domains requiring non-Euclidean similarity measures.

Future Directions

The paper leaves open several important avenues for future research. The authors touch upon, but do not fully explore, the parallelization potential for the improved PAM and CLARA algorithms. Future efforts might involve integrating these enhancements into distributed computing frameworks, further bridging the gap between theoretical efficiency and practical scalability on big data.

Furthermore, additional comparisons with alternative clustering approaches, especially within rapidly evolving data environments like online and streaming data, could offer more insights into the alignment of $k$ -medoids improvements with modern data handling needs.

Overall, the modifications and resulting efficiencies highlighted in this work is a valuable contribution to clustering literature, offering a crucial toolkit for both theoretical advances and practical realizations in clustering methodologies reliant on $k$ -medoids optimization principles.

PDF Markdown