Emergent Mind

Abstract

We consider bottom-k sampling for a set X, picking a sample Sk(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |Sk(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A union B)=Sk(Sk(A) union Sk(B)), and then the similarity is estimated as |Sk(A union B) intersect Sk(A) intersect Sk(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h1,...,h_k, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.