Papers
Topics
Authors
Recent
2000 character limit reached

On the k-Means/Median Cost Function (1704.05232v2)

Published 18 Apr 2017 in cs.DS

Abstract: In this work, we study the $k$-means cost function. Given a dataset $X \subseteq \mathbb{R}d$ and an integer $k$, the goal of the Euclidean $k$-means problem is to find a set of $k$ centers $C \subseteq \mathbb{R}d$ such that $\Phi(C, X) \equiv \sum_{x \in X} \min_{c \in C} ||x - c||2$ is minimized. Let $\Delta(X,k) \equiv \min_{C \subseteq \mathbb{R}d} \Phi(C, X)$ denote the cost of the optimal $k$-means solution. For any dataset $X$, $\Delta(X,k)$ decreases as $k$ increases. In this work, we try to understand this behaviour more precisely. For any dataset $X \subseteq \mathbb{R}d$, integer $k \geq 1$, and a precision parameter $\varepsilon > 0$, let $L(X, k, \varepsilon)$ denote the smallest integer such that $\Delta(X, L(X, k, \varepsilon)) \leq \varepsilon \cdot \Delta(X,k)$. We show upper and lower bounds on this quantity. Our techniques generalize for the metric $k$-median problem in arbitrary metric spaces and we give bounds in terms of the doubling dimension of the metric. Finally, we observe that for any dataset $X$, we can compute a set $S$ of size $O \left(L(X, k, \varepsilon/c) \right)$ using $D2$-sampling such that $\Phi(S,X) \leq \varepsilon \cdot \Delta(X,k)$ for some fixed constant $c$. We also discuss some applications of our bounds.

Citations (4)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.