Streaming Euclidean $k$-median and $k$-means with $o(\log n)$ Space (2310.02882v1)

Published 4 Oct 2023 in cs.DS

Abstract: We consider the classic Euclidean $k$-median and $k$-means objective on data streams, where the goal is to provide a $(1+\varepsilon)$-approximation to the optimal $k$-median or $k$-means solution, while using as little memory as possible. Over the last 20 years, clustering in data streams has received a tremendous amount of attention and has been the test-bed for a large variety of new techniques, including coresets, the merge-and-reduce framework, bicriteria approximation, sensitivity sampling, and so on. Despite this intense effort to obtain smaller sketches for these problems, all known techniques require storing at least $\Omega(\log(n\Delta))$ words of memory, where $n$ is the size of the input and $\Delta$ is the aspect ratio. A natural question is if one can beat this logarithmic dependence on $n$ and $\Delta$. In this paper, we break this barrier by first giving an insertion-only streaming algorithm that achieves a $(1+\varepsilon)$-approximation to the more general $(k,z)$-clustering problem, using $\tilde{\mathcal{O}}\left(\frac{dk}{\varepsilon^{2}\right)\cdot(2^{z\log} z})\cdot\min\left(\frac{1}{\varepsilon^{z},k\right)\cdot\text{poly}(\log\log(n\Delta))$} words of memory. Our techniques can also be used to achieve two-pass algorithms for $k$-median and $k$-means clustering on dynamic streams using $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{2}\right)\cdot\text{poly}(d,k,\log\log(n\Delta))$} words of memory.

Citations (8)

View on Semantic Scholar