Emergent Mind

On the k-Means/Median Cost Function

(1704.05232)
Published Apr 18, 2017 in cs.DS

Abstract

In this work, we study the $k$-means cost function. Given a dataset $X \subseteq \mathbb{R}d$ and an integer $k$, the goal of the Euclidean $k$-means problem is to find a set of $k$ centers $C \subseteq \mathbb{R}d$ such that $\Phi(C, X) \equiv \sum{x \in X} \min{c \in C} ||x - c||2$ is minimized. Let $\Delta(X,k) \equiv \min_{C \subseteq \mathbb{R}d} \Phi(C, X)$ denote the cost of the optimal $k$-means solution. For any dataset $X$, $\Delta(X,k)$ decreases as $k$ increases. In this work, we try to understand this behaviour more precisely. For any dataset $X \subseteq \mathbb{R}d$, integer $k \geq 1$, and a precision parameter $\varepsilon > 0$, let $L(X, k, \varepsilon)$ denote the smallest integer such that $\Delta(X, L(X, k, \varepsilon)) \leq \varepsilon \cdot \Delta(X,k)$. We show upper and lower bounds on this quantity. Our techniques generalize for the metric $k$-median problem in arbitrary metric spaces and we give bounds in terms of the doubling dimension of the metric. Finally, we observe that for any dataset $X$, we can compute a set $S$ of size $O \left(L(X, k, \varepsilon/c) \right)$ using $D2$-sampling such that $\Phi(S,X) \leq \varepsilon \cdot \Delta(X,k)$ for some fixed constant $c$. We also discuss some applications of our bounds.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.