Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Similarity Sketching (1704.04370v4)

Published 14 Apr 2017 in cs.DS

Abstract: We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = {0,\ldots, u-1}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) = |A\cap B|/|A\cup B|$ between sets $A$ and $B$ is preserved. More precisely, define $X_i = [S(A)[i] = S(B)[i]]$ and $X = \sum_{i\in [t]} X_i$. We want $E[X_i]=J(A,B)$, and we want $X$ to be strongly concentrated around $E[X] = t \cdot J(A,B)$ (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors $S(A)$ are also called $\textit{sketches}$. Strong concentration is critical, for often we want to sketch many sets $B_1,\ldots,B_n$ so that we later, for a query set $A$, can find (one of) the most similar $B_i$. It is then critical that no $B_i$ looks much more similar to $A$ due to errors in the sketch. The seminal $t\times\textit{MinHash}$ algorithm uses $t$ random hash functions $h_1,\ldots, h_t$, and stores $\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )$ as the sketch of $A$. The main drawback of MinHash is, however, its $O(t\cdot |A|)$ running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Efficient algorithms for substring near neighbor problem. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, pages 1203–1212. ACM Press, 2006.
  2. Sketching for big data recommender systems using fast pseudo-random fingerprints. In Fedor V. Fomin, Rusins Freivalds, Marta Z. Kwiatkowska, and David Peleg, editors, Automata, Languages, and Programming - 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part II, volume 7966 of Lecture Notes in Computer Science, pages 459–471. Springer, 2013.
  3. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, page 131–140, New York, NY, USA, 2007. Association for Computing Machinery.
  4. A. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES ’97, pages 21–29, Washington, DC, USA, 1997. IEEE Computer Society.
  5. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630–659, 2000. Announced at STOC’98.
  6. Syntactic clustering of the web. Computer Networks, 29(8-13):1157–1166, 1997.
  7. Tobias Christiani. Fast locality-sensitive hashing for approximate near neighbor search. CoRR, abs/1708.07586, 2017.
  8. Summarizing data using bottom-k sketches. In Proc. 26th PODC, pages 225–234, 2007.
  9. Hashing for statistics over k-partitions. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 1292–1310. IEEE Computer Society, 2015.
  10. Fast similarity sketching. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 663–671. IEEE Computer Society, 2017.
  11. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321–350, 2012.
  12. Monika Henzinger. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, page 284–291, New York, NY, USA, 2006. Association for Computing Machinery.
  13. Jakob Bæk Tejs Houen and Mikkel Thorup. Understanding the moments of tabulation hashing via chaoses. In 49th International Colloquium on Automata, Languages, and Programming, ICALP 2022, July 4-8, 2022, Paris, France, volume 229 of LIPIcs, pages 74:1–74:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022.
  14. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Jeffrey Scott Vitter, editor, Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, Dallas, Texas, USA, May 23-26, 1998, pages 604–613. ACM, 1998.
  15. Ping Li. 0-bit consistent weighted sampling. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 665–674, 08 2015.
  16. One permutation hashing. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 3122–3130, 2012.
  17. Hashing algorithms for large-scale learning. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 2672–2680, 2011.
  18. Introduction to information retrieval. Cambridge University Press, 2008.
  19. Optimal lower bounds for locality-sensitive hashing (except when q is tiny). TOCT, 6(1):5:1–5:13, 2014.
  20. On the k-independence required by linear probing and minwise independence. In Samson Abramsky, Cyril Gavoille, Claude Kirchner, Friedhelm Meyer auf der Heide, and Paul G. Spirakis, editors, Automata, Languages and Programming, pages 715–726, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
  21. Chernoff-hoeffding bounds for applications with limited independence. SIAM J. Discrete Math., 8(2):223–250, 1995.
  22. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing). The MIT Press, 2006.
  23. Anshumali Shrivastava. Optimal densification for fast and accurate minwise hashing. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 3154–3163. PMLR, 2017.
  24. Densifying one permutation hashing via rotation for fast near neighbor search. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 557–565. JMLR.org, 2014.
  25. Improved densification of one permutation hashing. In Nevin L. Zhang and Jin Tian, editors, Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 2014, Quebec City, Quebec, Canada, July 23-27, 2014, pages 732–741. AUAI Press, 2014.
  26. Mikkel Thorup. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, page 371–380, New York, NY, USA, 2013. Association for Computing Machinery.
Citations (30)

Summary

We haven't generated a summary for this paper yet.