Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the k-Means/Median Cost Function (1704.05232v2)

Published 18 Apr 2017 in cs.DS

Abstract: In this work, we study the $k$-means cost function. Given a dataset $X \subseteq \mathbb{R}d$ and an integer $k$, the goal of the Euclidean $k$-means problem is to find a set of $k$ centers $C \subseteq \mathbb{R}d$ such that $\Phi(C, X) \equiv \sum_{x \in X} \min_{c \in C} ||x - c||2$ is minimized. Let $\Delta(X,k) \equiv \min_{C \subseteq \mathbb{R}d} \Phi(C, X)$ denote the cost of the optimal $k$-means solution. For any dataset $X$, $\Delta(X,k)$ decreases as $k$ increases. In this work, we try to understand this behaviour more precisely. For any dataset $X \subseteq \mathbb{R}d$, integer $k \geq 1$, and a precision parameter $\varepsilon > 0$, let $L(X, k, \varepsilon)$ denote the smallest integer such that $\Delta(X, L(X, k, \varepsilon)) \leq \varepsilon \cdot \Delta(X,k)$. We show upper and lower bounds on this quantity. Our techniques generalize for the metric $k$-median problem in arbitrary metric spaces and we give bounds in terms of the doubling dimension of the metric. Finally, we observe that for any dataset $X$, we can compute a set $S$ of size $O \left(L(X, k, \varepsilon/c) \right)$ using $D2$-sampling such that $\Phi(S,X) \leq \varepsilon \cdot \Delta(X,k)$ for some fixed constant $c$. We also discuss some applications of our bounds.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Anup Bhattacharya (15 papers)
  2. Yoav Freund (31 papers)
  3. Ragesh Jaiswal (27 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.