Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 41 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 178 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

C-MinHash: Practically Reducing Two Permutations to Just One (2109.04595v1)

Published 10 Sep 2021 in cs.DS and cs.LG

Abstract: Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHash (Li and Li, 2021) has shown, with rigorous proofs, that only two permutations are needed. An initial permutation is applied to break whatever structures which might exist in the data, and a second permutation is re-used $K$ times to produce $K$ hashes, via a circulant shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even though the $K$ hashes are correlated, the estimation variance is strictly smaller than the variance of the traditional MinHash. It has been demonstrated in (Li and Li, 2021) that the initial permutation in C-MinHash is indeed necessary. For the ease of theoretical analysis, they have used two independent permutations. In this paper, we show that one can actually simply use one permutation. That is, one single permutation is used for both the initial pre-processing step to break the structures in the data and the circulant hashing step to generate $K$ hashes. Although the theoretical analysis becomes very complicated, we are able to explicitly write down the expression for the expectation of the estimator. The new estimator is no longer unbiased but the bias is extremely small and has essentially no impact on the estimation accuracy (mean square errors). An extensive set of experiments are provided to verify our claim for using just one permutation.

Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Authors (2)