ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data (2403.04871v1)
Abstract: Applications increasingly leverage mixed-modality data, and must jointly search over vector data, such as embedded images, text and video, as well as structured data, such as attributes and keywords. Proposed methods for this hybrid search setting either suffer from poor performance or support a severely restricted set of search predicates (e.g., only small sets of equality predicates), making them impractical for many applications. To address this, we present ACORN, an approach for performant and predicate-agnostic hybrid search. ACORN builds on Hierarchical Navigable Small Worlds (HNSW), a state-of-the-art graph-based approximate nearest neighbor index, and can be implemented efficiently by extending existing HNSW libraries. ACORN introduces the idea of predicate subgraph traversal to emulate a theoretically ideal, but impractical, hybrid search strategy. ACORN's predicate-agnostic construction algorithm is designed to enable this effective search strategy, while supporting a wide array of predicate sets and query semantics. We systematically evaluate ACORN on both prior benchmark datasets, with simple, low-cardinality predicate sets, and complex multi-modal datasets not supported by prior methods. We show that ACORN achieves state-of-the-art performance on all datasets, outperforming prior methods with 2-1,000x higher throughput at a fixed recall.
- [n. d.]. Filtered Vector Search | Weaviate - vector database. https://weaviate.io/developers/weaviate/concepts/prefiltering
- [n. d.]. Pre-label and enrich data with bulk classifications. https://labelbox.ghost.io/blog/pre-label-and-enrich-your-data-with-bulk-classifications/
- [n. d.]. Q&A over Documents - LlamaIndex 0.8.43. https://gpt-index.readthedocs.io/en/latest/
- 2023a. DiskANN. https://github.com/microsoft/DiskANN original-date: 2020-06-18T06:18:06Z.
- 2023b. Faiss. https://github.com/facebookresearch/faiss
- 2023c. Milvus Documentation. https://github.com/milvus-io/milvus-docs original-date: 2020-05-27T09:12:23Z.
- 2023. visual-layer/fastdup. https://github.com/visual-layer/fastdup
- Ann Arbor Algorithms. 2023. KGraph: A Library for Approximate Nearest Neighbor Search. https://github.com/aaalgo/kgraph original-date: 2015-05-29T12:38:24Z.
- Alexandr Andoni and Piotr Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 1 (Jan. 2008), 117–122. https://doi.org/10.1145/1327452.1327494
- Practical and optimal LSH for angular distance. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’15). MIT Press, Cambridge, MA, USA, 1225–1233.
- Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal Data-Dependent Hashing for Approximate Near Neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing (STOC ’15). Association for Computing Machinery, New York, NY, USA, 793–801. https://doi.org/10.1145/2746539.2746553
- AshenOn3. 2023. NHQ: An Efficient and Robust Framework for Approximate Nearest Neighbor Search with Attribute Constraint. https://github.com/AshenOn3/NHQ original-date: 2021-09-09T08:28:21Z.
- ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems 87 (Jan. 2020), 101374. https://doi.org/10.1016/j.is.2019.02.006
- Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. https://doi.org/10.48550/arXiv.1802.02422 arXiv:1802.02422 [cs].
- Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (Sept. 1975), 509–517. https://doi.org/10.1145/361002.361007
- Erik Bernhardsson. [n. d.]. annoy: Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk. https://github.com/spotify/annoy
- Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning (ICML ’06). Association for Computing Machinery, New York, NY, USA, 97–104. https://doi.org/10.1145/1143844.1143857
- VisRel: Media Search at Scale. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD ’21). Association for Computing Machinery, New York, NY, USA, 2584–2592. https://doi.org/10.1145/3447548.3467081
- Sanjoy Dasgupta and Yoav Freund. 2008. Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing. ACM, Victoria British Columbia Canada, 537–546. https://doi.org/10.1145/1374376.1374452
- Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web (WWW ’11). Association for Computing Machinery, New York, NY, USA, 577–586. https://doi.org/10.1145/1963405.1963487
- Amazon Shop the Look: A Visual Search System for Fashion and Home. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). Association for Computing Machinery, New York, NY, USA, 2822–2830. https://doi.org/10.1145/3534678.3539071
- Fast approximate nearest neighbor search with the navigating spreading-out graph. Proceedings of the VLDB Endowment 12, 5 (Jan. 2019), 461–474. https://doi.org/10.14778/3303753.3303754
- Optimized Product Quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 4 (April 2014), 744–755. https://doi.org/10.1109/TPAMI.2013.240 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB ’99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 518–529.
- Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters. In Proceedings of the ACM Web Conference 2023. ACM, Austin TX USA, 3406–3416. https://doi.org/10.1145/3543507.3583552
- iDEC: indexable distance estimating codes for approximate nearest neighbor search. Proceedings of the VLDB Endowment 13, 9 (May 2020), 1483–1497. https://doi.org/10.14778/3397230.3397243
- Accelerating large-scale inference with anisotropic vector quantization. In Proceedings of the 37th International Conference on Machine Learning (ICML’20, Vol. 119). JMLR.org, 3887–3896.
- Michael E. Houle and Michael Nett. 2015. Rank-Based Similarity Search: Reducing the Dimensional Dependence. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 1 (Jan. 2015), 136–150. https://doi.org/10.1109/TPAMI.2014.2343223 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing (STOC ’98). Association for Computing Machinery, New York, NY, USA, 604–613. https://doi.org/10.1145/276698.276876
- mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data. In Similarity Search and Applications (Lecture Notes in Computer Science), Shin’ichi Satoh, Lucia Vadicamo, Arthur Zimek, Fabio Carrara, Ilaria Bartolini, Martin Aumüller, Björn Þór Jónsson, and Rasmus Pagh (Eds.). Springer International Publishing, Cham, 47–61. https://doi.org/10.1007/978-3-030-60936-8_4
- J.W. Jaromczyk and G.T. Toussaint. 1992. Relative neighborhood graphs and their relatives. Proc. IEEE 80, 9 (Sept. 1992), 1502–1517. https://doi.org/10.1109/5.163414 Conference Name: Proceedings of the IEEE.
- DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc. https://papers.nips.cc/paper_files/paper/2019/hash/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Abstract.html
- Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search. In Computer Vision – ECCV 2008 (Lecture Notes in Computer Science), David Forsyth, Philip Torr, and Andrew Zisserman (Eds.). Springer, Berlin, Heidelberg, 304–317. https://doi.org/10.1007/978-3-540-88682-2_24
- Billion-scale similarity search with GPUs. http://arxiv.org/abs/1702.08734 arXiv:1702.08734 [cs].
- Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (Jan. 2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Dense Passage Retrieval for Open-Domain Question Answering. https://arxiv.org/abs/2004.04906v3
- Philip M. Lankford. 1969. Regionalization: Theory and Alternative Algorithms. Geographical Analysis 1, 2 (1969), 196–212. https://doi.org/10.1111/j.1538-4632.1969.tb00615.x _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1538-4632.1969.tb00615.x.
- D. T. Lee and B. J. Schachter. 1980. Two algorithms for constructing a Delaunay triangulation. International Journal of Computer & Information Sciences 9, 3 (June 1980), 219–242. https://doi.org/10.1007/BF00977785
- V. Lempitsky and A. Babenko. 2012. The inverted multi-index. IEEE Computer Society, 3069–3076. https://doi.org/10.1109/CVPR.2012.6248038 ISSN: 1063-6919.
- I/O Efficient Approximate Nearest Neighbour Search based on Learned Functions. 2020 IEEE 36th International Conference on Data Engineering (ICDE) (April 2020), 289–300. https://doi.org/10.1109/ICDE48307.2020.00032 Conference Name: 2020 IEEE 36th International Conference on Data Engineering (ICDE) ISBN: 9781728129037 Place: Dallas, TX, USA Publisher: IEEE.
- EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search. The VLDB Journal 30, 2 (March 2021), 215–235. https://doi.org/10.1007/s00778-020-00635-4
- Pre-trained Language Model for Web-scale Retrieval in Baidu Search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD ’21). Association for Computing Machinery, New York, NY, USA, 3365–3375. https://doi.org/10.1145/3447548.3467149
- David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 2 (Nov. 2004), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
- Kejing Lu and Mineichi Kudo. 2020. R2LSH: A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1045–1056. https://doi.org/10.1109/ICDE48307.2020.00095 ISSN: 2375-026X.
- VHP: approximate nearest neighbor search via virtual hypersphere partitioning. Proceedings of the VLDB Endowment 13, 9 (May 2020), 1443–1455. https://doi.org/10.14778/3397230.3397240
- Intelligent probing for locality sensitive hashing: multi-probe LSH and beyond. Proceedings of the VLDB Endowment 10, 12 (Aug. 2017), 2021–2024. https://doi.org/10.14778/3137765.3137836
- Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems 45 (Sept. 2014), 61–68. https://doi.org/10.1016/j.is.2013.10.006
- Yu A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. http://arxiv.org/abs/1603.09320 arXiv:1603.09320 [cs].
- High-Throughput Vector Similarity Search in Knowledge Graphs. http://arxiv.org/abs/2304.01926 arXiv:2304.01926 [cs].
- Marius Muja and David G. Lowe. 2014. Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 11 (Nov. 2014), 2227–2240. https://doi.org/10.1109/TPAMI.2014.2321376 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Gonzalo Navarro. 2002. Searching in metric spaces by spatial approximation. The VLDB Journal 11, 1 (Aug. 2002), 28–46. https://doi.org/10.1007/s007780200060
- Neighbor-sensitive hashing. Proceedings of the VLDB Endowment 9, 3 (Nov. 2015), 144–155. https://doi.org/10.14778/2850583.2850589
- Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020 arXiv:2103.00020 [cs].
- TripClick: The Log Files of a Large Health Web Search Engine. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2507–2513. https://doi.org/10.1145/3404835.3463242 arXiv:2103.07901 [cs].
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. https://doi.org/10.48550/arXiv.2111.02114 arXiv:2111.02114 [cs].
- Chanop Silpa-Anan and Richard Hartley. 2008. Optimised KD-trees for fast image descriptor matching. IEEE Computer Society, 1–8. https://doi.org/10.1109/CVPR.2008.4587638
- Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor Search. http://arxiv.org/abs/2205.03763 arXiv:2205.03763 [cs].
- FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. https://doi.org/10.48550/arXiv.2105.09613 arXiv:2105.09613 [cs].
- Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB Endowment 6, 14 (Sept. 2013), 1930–1941. https://doi.org/10.14778/2556549.2556574
- Godfried T. Toussaint. 1980. The relative neighbourhood graph of a finite planar set. Pattern Recognition 12, 4 (Jan. 1980), 261–268. https://doi.org/10.1016/0031-3203(80)90066-7
- Andrei Vasnetsov. [n. d.]. Filtrable HNSW - Qdrant. https://qdrant.tech/articles/filtrable-hnsw/
- Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 2614–2627. https://doi.org/10.1145/3448016.3457550
- Navigable Proximity Graph-Driven Native Hybrid Queries with Structured and Unstructured Constraints. http://arxiv.org/abs/2203.13601 arXiv:2203.13601 [cs].
- AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data. Proceedings of the VLDB Endowment 13, 12 (Aug. 2020), 3152–3165. https://doi.org/10.14778/3415478.3415541
- Brie Wolfson. 2023. Building chat langchain. https://blog.langchain.dev/building-chat-langchain-2/
- HQANN: Efficient and Robust Similarity Search for Hybrid Queries with Structured and Unstructured Constraints. http://arxiv.org/abs/2207.07940 arXiv:2207.07940 [cs].
- {VBASE}: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity. 377–395. https://www.usenix.org/conference/osdi23/presentation/zhang-qianxi
- SONG: Approximate Nearest Neighbor Search on GPU. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1033–1044. https://doi.org/10.1109/ICDE48307.2020.00094 ISSN: 2375-026X.
- PM-LSH: A fast and accurate LSH framework for high-dimensional approximate NN search. Proceedings of the VLDB Endowment 13, 5 (Jan. 2020), 643–655. https://doi.org/10.14778/3377369.3377374