On the k-Means/Median Cost Function (1704.05232v2)

Published 18 Apr 2017 in cs.DS

Abstract: In this work, we study the $k$-means cost function. Given a dataset $X \subseteq \mathbb{R}^d$ and an integer $k$, the goal of the Euclidean $k$-means problem is to find a set of $k$ centers $C \subseteq \mathbb{R}^d$ such that $\Phi(C, X) \equiv \sum_{x \in X} \min_{c \in C} ||x - c||^2$ is minimized. Let $\Delta(X,k) \equiv \min_{C \subseteq \mathbb{R}^d} \Phi(C, X)$ denote the cost of the optimal $k$-means solution. For any dataset $X$, $\Delta(X,k)$ decreases as $k$ increases. In this work, we try to understand this behaviour more precisely. For any dataset $X \subseteq \mathbb{R}^d$, integer $k \geq 1$, and a precision parameter $\varepsilon > 0$, let $L(X, k, \varepsilon)$ denote the smallest integer such that $\Delta(X, L(X, k, \varepsilon)) \leq \varepsilon \cdot \Delta(X,k)$. We show upper and lower bounds on this quantity. Our techniques generalize for the metric $k$-median problem in arbitrary metric spaces and we give bounds in terms of the doubling dimension of the metric. Finally, we observe that for any dataset $X$, we can compute a set $S$ of size $O \left(L(X, k, \varepsilon/c) \right)$ using $D^2$-sampling such that $\Phi(S,X) \leq \varepsilon \cdot \Delta(X,k)$ for some fixed constant $c$. We also discuss some applications of our bounds.

Authors (3)

Anup Bhattacharya (15 papers)
Yoav Freund (31 papers)
Ragesh Jaiswal (27 papers)

Citations (4)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Related Papers

Sets Clustering (2020)
Tight FPT Approximation for Socially Fair Clustering (2021)
Hardness of Approximation of Euclidean $k$-Median (2020)
FPT Approximation for Constrained Metric $k$-Median/Means (2020)
Streaming PTAS for Constrained k-Means (2019)