Improved Space-Efficient Approximate Nearest Neighbor Search Using Function Inversion (2407.02468v1)
Abstract: Approximate nearest neighbor search (ANN) data structures have widespread applications in machine learning, computational biology, and text processing. The goal of ANN is to preprocess a set S so that, given a query q, we can find a point y whose distance from q approximates the smallest distance from q to any point in S. For most distance functions, the best-known ANN bounds for high-dimensional point sets are obtained using techniques based on locality-sensitive hashing (LSH). Unfortunately, space efficiency is a major challenge for LSH-based data structures. Classic LSH techniques require a very large amount of space, oftentimes polynomial in |S|. A long line of work has developed intricate techniques to reduce this space usage, but these techniques suffer from downsides: they must be hand tailored to each specific LSH, are often complicated, and their space reduction comes at the cost of significantly increased query times. In this paper we explore a new way to improve the space efficiency of LSH using function inversion techniques, originally developed in (Fiat and Naor 2000). We begin by describing how function inversion can be used to improve LSH data structures. This gives a fairly simple, black box method to reduce LSH space usage. Then, we give a data structure that leverages function inversion to improve the query time of the best known near-linear space data structure for approximate nearest neighbor search under Euclidean distance: the ALRW data structure of (Andoni, Laarhoven, Razenshteyn, and Waingarten 2017). ALRW was previously shown to be optimal among "list-of-points" data structures for both Euclidean and Manhattan ANN; thus, in addition to giving improved bounds, our results imply that list-of-points data structures are not optimal for Euclidean or Manhattan ANN.
- Parameter-free locality sensitive hashing for spherical range reporting. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 239–256. SIAM, 2017.
- Thomas Dybdahl Ahle. Optimal las vegas locality sensitive data structures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 938–949. IEEE, 2017.
- On the complexity of inner product similarity join. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 151–164, 2016.
- Probabilistic polynomials and hamming nearest neighbors. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 136–150. IEEE, 2015.
- Alexandr Andoni. Nearest neighbor search: the old, the new, and the impossible. PhD thesis, Massachusetts Institute of Technology, 2009.
- Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 47th Annual Symposium on Foundations of Computer Science (FOCS), pages 459–468. IEEE, 2006.
- Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry, pages 1135–1155. Chapman and Hall/CRC, 2017.
- Practical and optimal lsh for angular distance. Advances in neural information processing systems, 28, 2015.
- Beyond locality-sensitive hashing. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1018–1028. SIAM, 2014.
- Lower bounds on time-space trade-offs for approximate near neighbors. arXiv preprint arXiv:1605.02701, 2016.
- Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms, pages 47–66. SIAM, 2017.
- Optimal hashing-based time-space trade-offs for approximate near neighbors. CoRR, 2016.
- Optimal data-dependent hashing for approximate near neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 793–801, 2015.
- A general technique for searching in implicit sets via function inversion. arXiv preprint arXiv:2311.12471, 2023.
- Time and space efficient collinearity indexing. Computational Geometry, 110:101963, 2023.
- Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374, 2020.
- Fair near neighbor search via sampling. ACM SIGMOD Record, 50(1):42–49, 2021.
- Fair near neighbor search: Independent range sampling in high dimensions. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 191–204, 2020.
- Rigorous bounds on cryptanalytic time/memory tradeoffs. In Annual International Cryptology Conference, pages 1–21. Springer, 2006.
- Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data: Recent advances in clustering, pages 25–71. Springer, 2006.
- Gapped string indexing in subquadratic space and sublinear query time. arXiv preprint arXiv:2211.16860, 2022.
- Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE, 1997.
- Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pages 493–507, 1952.
- Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 31–46. SIAM, 2017.
- Set similarity search beyond minhash. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pages 1094–1107, 2017.
- The function-inversion problem: Barriers and opportunities. In Theory of Cryptography Conference, pages 393–421. Springer, 2019.
- Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1):21–27, 1967.
- Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253–262, 2004.
- Locality-sensitive hashing of curves. In 33rd International Symposium on Computational Geometry (SoCG 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
- Rigorous time/space trade-offs for inverting functions. SIAM Journal on Computing, 29(3):790–803, 2000.
- Data structures meet cryptography: 3sum with preprocessing. In Proceedings of the 52nd annual ACM SIGACT symposium on theory of computing, pages 294–307, 2020.
- Revisiting time-space tradeoffs for function inversion. In Annual International Cryptology Conference, pages 453–481. Springer, 2023.
- Locality-sensitive hashing for chi2 distance. IEEE transactions on pattern analysis and machine intelligence, 34(2):402–409, 2011.
- Locality sensitive hashing for scalable structural classification and clustering of web documents. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 359–368, 2013.
- Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory OF Computing, 8:321–350, 2012.
- Martin Hellman. A cryptanalytic time-memory trade-off. IEEE transactions on Information Theory, 26(4):401–406, 1980.
- Piotr Indyk. High-dimensional computational geometry. PhD thesis, Stanford University, 2001.
- Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.
- Michael Kapralov. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 329–342, 2015.
- Faster compression methods for a weighted graph using locality sensitive hashing. Information Sciences, 421:237–253, 2017.
- The strong 3sum-indexing conjecture is false. arXiv preprint arXiv:1907.11206, 2019.
- Thijs Laarhoven. Tradeoffs for nearest neighbors on the sphere. arXiv preprint arXiv:1511.07527, 2015.
- Thijs Laarhoven. Graph-based time-space trade-offs for approximate near neighbors. In 34th International Symposium on Computational Geometry (SoCG 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
- The geometry of graphs and some of its algorithmic applications. Combinatorica, 15:215–245, 1995.
- Locality-sensitive hashing for the edit distance. Bioinformatics, 35(14):i127–i135, 2019.
- Samuel McCauley. Approximate similarity search under edit distance using locality-sensitive hashing. In 24th International Conference on Database Theory, 2021.
- Set similarity search for skewed data. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 63–74, 2018.
- Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, 1969.
- Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017.
- Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1186–1195, 2006.
- Aviad Rubinstein. Hardness of approximate nearest neighbor search. In Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pages 1260–1268, 2018.
- A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.
- Compressing locality sensitive hashing tables. In 2013 Mexican International Conference on Computer Science, pages 41–46. IEEE, 2013.
- Ryan Williams. On the difference between closest, furthest, and orthogonal pairs: Nearly-linear vs barely-subquadratic complexity. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1207–1215. SIAM, 2018.
- Andrew Chi-Chih Yao. Coherent functions and program checkers. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 84–94, 1990.