Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Differentially Private One Permutation Hashing and Bin-wise Consistent Weighted Sampling (2306.07674v1)

Published 13 Jun 2023 in stat.ML, cs.CR, cs.DS, and cs.LG

Abstract: Minwise hashing (MinHash) is a standard algorithm widely used in the industry, for large-scale search and learning applications with the binary (0/1) Jaccard similarity. One common use of MinHash is for processing massive n-gram text representations so that practitioners do not have to materialize the original data (which would be prohibitive). Another popular use of MinHash is for building hash tables to enable sub-linear time approximate near neighbor (ANN) search. MinHash has also been used as a tool for building large-scale machine learning systems. The standard implementation of MinHash requires applying $K$ random permutations. In comparison, the method of one permutation hashing (OPH), is an efficient alternative of MinHash which splits the data vectors into $K$ bins and generates hash values within each bin. OPH is substantially more efficient and also more convenient to use. In this paper, we combine the differential privacy (DP) with OPH (as well as MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH. A detailed roadmap to the algorithm design is presented along with the privacy analysis. An analytical comparison of our proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to justify the advantage of DP-OPH. Experiments on similarity search confirm the merits of DP-OPH, and guide the choice of the proper variant in different practical scenarios. Our technique is also extended to bin-wise consistent weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for non-binary data. Experiments on classification tasks demonstrate that DP-BCWS is able to achieve excellent utility at around $\epsilon = 5\sim 10$, where $\epsilon$ is the standard parameter in the language of $(\epsilon, \delta)$-DP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (109)
  1. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 308–318, Vienna, Austria, 2016.
  2. cpSGD: Communication-efficient and differentially-private distributed SGD. In Advances in Neural Information Processing Systems (NeurIPS), pages 7575–7586, Montréal, Canada, 2018.
  3. A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/$∼$mlearn/{MLR}epository.html.
  4. Differentially private sketches for jaccard similarity estimation. CoRR, abs/2008.08134, 2020.
  5. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences, 483:53–64, 2019.
  6. Finding text reuse on the web. In Proceedings of the Second International Conference on Web Search and Web Data Mining (WSDM), pages 262–271, Barcelona, Spain, 2009.
  7. The johnson-lindenstrauss transform itself preserves differential privacy. In Proceedings of the 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 410–419, New Brunswick, NJ, 2012.
  8. Practical privacy: the SuLQ framework. In Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pages 128–138, Baltimore, MD, 2005.
  9. A web search engine-based approach to measure semantic similarity between words. IEEE Trans. Knowl. Data Eng., 23(7):977–990, 2011.
  10. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences (SEQUENCES), pages 21–29, Salerno, Italy, 1997.
  11. Syntactic clustering of the web. Comput. Networks, 29(8-13):1157–1166, 1997.
  12. Min-wise independent permutations. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing (STOC), pages 327–336, Dallas, TX, 1998.
  13. A scalable pattern mining approach to web graph compression with communities. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), pages 95–106, Stanford, CA, 2008.
  14. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
  15. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3–15, 2004.
  16. Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing (STOC), pages 380–388, Montreal, Canada, 2002.
  17. Privacy-preserving logistic regression. In Advances in Neural Information Processing Systems (NIPS), pages 289–296, Vancouver, Canada, 2008.
  18. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069–1109, 2011.
  19. On compressing social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 219–228, Paris, France, 2009.
  20. Fast computation of min-hash signatures for image collections. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3077–3084, Providence, RI, 2012.
  21. Practical hash functions for similarity estimation and dimensionality reduction. In Advances in Neural Information Processing Systems (NIPS), pages 6615–6625, Long Beach, CA, USA, 2017.
  22. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web (WWW), pages 271–280, Banff, Alberta, Canada, 2007.
  23. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry (SCG), pages 253–262, Brooklyn, NY, 2004.
  24. A data driven approach for person name disambiguation in web search results. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), pages 301–310, Dublin, Ireland, 2014.
  25. Efficient jaccard-based diversity analysis of large document collections. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM), pages 1402–1411, Maui, HI, 2012.
  26. Order-invariant cardinality estimators are differentially private. In Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, 2022.
  27. Gaussian differential privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3–37, 2022.
  28. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  29. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology - EUROCRYPT 2006, 25th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28 - June 1, 2006, Proceedings, volume 4004 of Lecture Notes in Computer Science, pages 486–503. Springer, 2006a.
  30. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Theory of Cryptography Conference (TCC), pages 265–284, New York, NY, 2006b.
  31. Otmar Ertl. BagMinHash - minwise hashing algorithm for weighted sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1368–1377, London, UK, 2018.
  32. Private graph all-pairwise-shortest-path distance release with improved error rate. In Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, 2022.
  33. Private coresets. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC), pages 361–370, Bethesda, MD, 2009.
  34. Allign: Aligning all-pair near-duplicate passages in long texts. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 541–553, Virtual Event, China, 2021.
  35. A large-scale study of the evolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference (WWW), pages 669–678, Budapest, Hungary, 2003.
  36. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182–209, 1985.
  37. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pages 137–156, 2007.
  38. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST), pages 331–344, Santa Clara, CA, 2015.
  39. Intent-driven similarity in e-commerce listings. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 2437–2444, Virtual Event, Ireland, 2020.
  40. Minimax-optimal privacy-preserving sparse PCA in distributed systems. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1589–1598, Playa Blanca, Lanzarote, Canary Islands, Spain, 2018.
  41. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145, 1995.
  42. Exploiting asymmetry in hierarchical topic extraction. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management (CIKM), pages 475–482, Arlington, VA, 2006.
  43. An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 381–390, Madrid, Spain, 2009.
  44. Differentially private combinatorial optimization. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1106–1125, Austin, TX, 2010.
  45. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2938–2945, 2013.
  46. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing (STOC), pages 604–613, Dallas, TX, 1998.
  47. Sergey Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), pages 246–255, Sydney, Australia, 2010.
  48. Bidirectionally densifying LSH sketches with empty bins. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 830–842, Virtual Event, China, 2021.
  49. Massive text normalization via an efficient randomized algorithm. In WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, pages 2946–2956. ACM, 2022.
  50. Analyzing graphs with node differential privacy. In Proceedings of the 10th Theory of Cryptography Conference, TCC, volume 7785, pages 457–476, Tokyo, Japan, 2013.
  51. Privacy via the johnson-lindenstrauss transform. J. Priv. Confidentiality, 5(1), 2013.
  52. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS), pages 14–23, New York, NY, 1999.
  53. Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  54. Partition min-hash for partial duplicate image discovery. In Proceedings of the 11th European Conference on Computer Vision (ECCV), Part I, pages 648–662, Heraklion, Crete, Greece, 2010.
  55. Locality-sensitive hashing scheme based on longest circular co-substring. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD), pages 2589–2599, Online conference [Portland, OR, USA], 2020.
  56. Jakub Lemiesz. On the algebra of data sketches. Proc. VLDB Endow., 14(9):1655–1667, 2021.
  57. Using index partitioning and reconciliation for data deduplication, August 18 2015. US Patent 9,110,936.
  58. Ping Li. Linearized GMM kernels and normalized random Fourier features. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 315–324, Halifax, Canada, 2017.
  59. Using sketches to estimate associations. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 708–715, Vancouver, Canada, https://github.com/pltrees/Smallest-K-Sketch, 2005.
  60. b-bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web (WWW), pages 671–680, Raleigh, NC, 2010.
  61. Differential privacy with random projections and sign random projections. arXiv preprint arXiv:2306.01751, 2023a.
  62. OPORP: One permutation + one random projection. arXiv preprint arXiv:2302.03505, 2023b.
  63. Hashing algorithms for large-scale learning. In Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, Granada, Spain, 2011.
  64. One permutation hashing. In Advances in Neural Information Processing Systems (NIPS), pages 3122–3130, Lake Tahoe, NV, 2012.
  65. Coding for random projections. In Proceedings of the 31th International Conference on Machine Learning (ICML), pages 676–684, Beijing, China, 2014.
  66. Re-randomized densification for one permutation hashing and bin-wise consistent weighted sampling. In Advances in Neural Information Processing Systems (NeurIPS), pages 15900–15910, Vancouver, Canada, 2019.
  67. Consistent sampling through extremal process. In Proceedings of the Web Conference (WWW), pages 1317–1327, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, 2021.
  68. P-MinHash algorithm for continuous probability measures: Theory and application to machine learning. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM), Atlanta, GA, 2022.
  69. C-OPH: Improving the accuracy of one permutation hashing (oph) with circulant permutations. arXiv preprint arXiv:2111.09544, 2021a.
  70. Rejection sampling for weighted jaccard similarity revisited. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), Virtual Event, 2021b.
  71. C-MinHash: Improving minwise hashing with circulant permutation. In Proceedings of the International Conference on Machine Learning (ICML), pages 12857–12887, Baltimore, MD, 2022a.
  72. SignRFF: Sign random fourier features. In Advances in Neural Information Processing Systems (NeurIPS), pages 17802–17817, New Orleans, LA, 2022b.
  73. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.
  74. Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1035–1044, San Francisco, CA, 2016.
  75. Table union search on open data. Proc. VLDB Endow., 11(7):813–825, 2018.
  76. Nearest-neighbor caching for content-match applications. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 441–450, Madrid, Spain, 2009.
  77. Cross-architecture bug search in binary executables. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP), pages 709–724, San Jose, CA, 2015.
  78. How to dp-fy ml: A practical guide to machine learning with differential privacy. arXiv preprint arXiv:2303.00654, 2023.
  79. Variance reduction in bipartite experiments through correlation clustering. In Advances in Neural Information Processing Systems (NeurIPS), pages 13288–13298, Vancouver, Canada, 2019.
  80. An alternative to NCD for large sequences, lempel-ziv jaccard distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1007–1015, Halifax, Canada, 2017.
  81. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177–1184, Vancouver, Canada, 2007.
  82. SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 871–880, New York, NY, 2014.
  83. On b-bit min-wise hashing for large-scale regression and classification with sparse data. J. Mach. Learn. Res., 18:178:1–178:42, 2017.
  84. Anshumali Shrivastava. Simple and efficient weighted minwise hashing. In Neural Information Processing Systems (NIPS), pages 1498–1506, Barcelona, Spain, 2016.
  85. Anshumali Shrivastava. Optimal densification for fast and accurate minwise hashing. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 3154–3163, Sydney, Australia, 2017.
  86. Fast near neighbor search in high-dimensional binary data. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), Part I, pages 474–489, Bristol, UK, 2012.
  87. In defense of minhash over simhash. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 886–894, Reykjavik, Iceland, 2014a.
  88. Densifying one permutation hashing via rotation for fast near neighbor search. In Proceedings of the 31th International Conference on Machine Learning (ICML), Beijing, China, 2014b.
  89. Improved densification of one permutation hashing. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI), pages 732–741, Quebec City, Canada, 2014c.
  90. The flajolet-martin sketch itself preserves differential privacy: Private counting with minimal space. In Advances in Neural Information Processing Systems, virtual, 2020.
  91. Nina Mesing Stausholm. Improved differentially private euclidean distance approximation. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), pages 42–56, Virtual Event, China, 2021.
  92. Guilt by association: large scale malware detection by mining file-relation graphs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1524–1533, New York, NY, 2014.
  93. Parallel index-based structural graph clustering and its approximation. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1851–1864, Virtual Event, China, 2021.
  94. Cross-pair text representations for answer sentence selection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2162–2173, Brussels, Belgium, 2018.
  95. Santosh S Vempala. The random projection method, volume 65. American Mathematical Soc., 2005.
  96. Learning fine-grained image similarity with deep ranking. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1386–1393, Columbus, OH, 2014.
  97. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 25–33, Anchorage, AK, 2019.
  98. Locally differentially private protocols for frequency estimation. In Proceedings of the 26th USENIX Security Symposium, USENIX Security (USENIX), pages 729–745, Vancouver, Canada, 2017.
  99. Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965.
  100. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur., 15:3454–3469, 2020.
  101. Differentially private histogram publication. VLDB J., 22(6):797–822, 2013.
  102. NodeSketch: Highly-efficient graph embeddings via recursive sketching. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1162–1172, Anchorage, AK, 2019.
  103. Hyperminhash: Minhash in loglog space. IEEE Trans. Knowl. Data Eng., 34(1):328–339, 2022.
  104. Functional mechanism: Regression analysis under differential privacy. Proc. VLDB Endow., 5(11):1364–1375, 2012.
  105. Differentially private linear sketches: Efficient implementations and applications. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  106. Distributed hierarchical GPU parameter server for massive scale deep learning ads systems. In Proceedings of Machine Learning and Systems 2020 (MLSys), Austin, TX, 2020.
  107. Building k-anonymous user cohorts with consecutive consistent weighted sampling (ccws). In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Taipei, 2023.
  108. Interactive navigation of open data linkages. Proc. VLDB Endow., 10(12):1837–1840, 2017.
  109. JOSIE: overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 847–864, Amsterdam, The Netherlands, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Xiaoyun Li (24 papers)
  2. Ping Li (421 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.