Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data (2403.04871v1)

Published 7 Mar 2024 in cs.IR and cs.DB

Abstract: Applications increasingly leverage mixed-modality data, and must jointly search over vector data, such as embedded images, text and video, as well as structured data, such as attributes and keywords. Proposed methods for this hybrid search setting either suffer from poor performance or support a severely restricted set of search predicates (e.g., only small sets of equality predicates), making them impractical for many applications. To address this, we present ACORN, an approach for performant and predicate-agnostic hybrid search. ACORN builds on Hierarchical Navigable Small Worlds (HNSW), a state-of-the-art graph-based approximate nearest neighbor index, and can be implemented efficiently by extending existing HNSW libraries. ACORN introduces the idea of predicate subgraph traversal to emulate a theoretically ideal, but impractical, hybrid search strategy. ACORN's predicate-agnostic construction algorithm is designed to enable this effective search strategy, while supporting a wide array of predicate sets and query semantics. We systematically evaluate ACORN on both prior benchmark datasets, with simple, low-cardinality predicate sets, and complex multi-modal datasets not supported by prior methods. We show that ACORN achieves state-of-the-art performance on all datasets, outperforming prior methods with 2-1,000x higher throughput at a fixed recall.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. [n. d.]. Filtered Vector Search | Weaviate - vector database. https://weaviate.io/developers/weaviate/concepts/prefiltering
  2. [n. d.]. Pre-label and enrich data with bulk classifications. https://labelbox.ghost.io/blog/pre-label-and-enrich-your-data-with-bulk-classifications/
  3. [n. d.]. Q&A over Documents - LlamaIndex 0.8.43. https://gpt-index.readthedocs.io/en/latest/
  4. 2023a. DiskANN. https://github.com/microsoft/DiskANN original-date: 2020-06-18T06:18:06Z.
  5. 2023b. Faiss. https://github.com/facebookresearch/faiss
  6. 2023c. Milvus Documentation. https://github.com/milvus-io/milvus-docs original-date: 2020-05-27T09:12:23Z.
  7. 2023. visual-layer/fastdup. https://github.com/visual-layer/fastdup
  8. Ann Arbor Algorithms. 2023. KGraph: A Library for Approximate Nearest Neighbor Search. https://github.com/aaalgo/kgraph original-date: 2015-05-29T12:38:24Z.
  9. Alexandr Andoni and Piotr Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 1 (Jan. 2008), 117–122. https://doi.org/10.1145/1327452.1327494
  10. Practical and optimal LSH for angular distance. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’15). MIT Press, Cambridge, MA, USA, 1225–1233.
  11. Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal Data-Dependent Hashing for Approximate Near Neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing (STOC ’15). Association for Computing Machinery, New York, NY, USA, 793–801. https://doi.org/10.1145/2746539.2746553
  12. AshenOn3. 2023. NHQ: An Efficient and Robust Framework for Approximate Nearest Neighbor Search with Attribute Constraint. https://github.com/AshenOn3/NHQ original-date: 2021-09-09T08:28:21Z.
  13. ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems 87 (Jan. 2020), 101374. https://doi.org/10.1016/j.is.2019.02.006
  14. Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. https://doi.org/10.48550/arXiv.1802.02422 arXiv:1802.02422 [cs].
  15. Jon Louis Bentley. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 9 (Sept. 1975), 509–517. https://doi.org/10.1145/361002.361007
  16. Erik Bernhardsson. [n. d.]. annoy: Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk. https://github.com/spotify/annoy
  17. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning (ICML ’06). Association for Computing Machinery, New York, NY, USA, 97–104. https://doi.org/10.1145/1143844.1143857
  18. VisRel: Media Search at Scale. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD ’21). Association for Computing Machinery, New York, NY, USA, 2584–2592. https://doi.org/10.1145/3447548.3467081
  19. Sanjoy Dasgupta and Yoav Freund. 2008. Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing. ACM, Victoria British Columbia Canada, 537–546. https://doi.org/10.1145/1374376.1374452
  20. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web (WWW ’11). Association for Computing Machinery, New York, NY, USA, 577–586. https://doi.org/10.1145/1963405.1963487
  21. Amazon Shop the Look: A Visual Search System for Fashion and Home. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22). Association for Computing Machinery, New York, NY, USA, 2822–2830. https://doi.org/10.1145/3534678.3539071
  22. Fast approximate nearest neighbor search with the navigating spreading-out graph. Proceedings of the VLDB Endowment 12, 5 (Jan. 2019), 461–474. https://doi.org/10.14778/3303753.3303754
  23. Optimized Product Quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 4 (April 2014), 744–755. https://doi.org/10.1109/TPAMI.2013.240 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  24. Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB ’99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 518–529.
  25. Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters. In Proceedings of the ACM Web Conference 2023. ACM, Austin TX USA, 3406–3416. https://doi.org/10.1145/3543507.3583552
  26. iDEC: indexable distance estimating codes for approximate nearest neighbor search. Proceedings of the VLDB Endowment 13, 9 (May 2020), 1483–1497. https://doi.org/10.14778/3397230.3397243
  27. Accelerating large-scale inference with anisotropic vector quantization. In Proceedings of the 37th International Conference on Machine Learning (ICML’20, Vol. 119). JMLR.org, 3887–3896.
  28. Michael E. Houle and Michael Nett. 2015. Rank-Based Similarity Search: Reducing the Dimensional Dependence. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 1 (Jan. 2015), 136–150. https://doi.org/10.1109/TPAMI.2014.2343223 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  29. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing (STOC ’98). Association for Computing Machinery, New York, NY, USA, 604–613. https://doi.org/10.1145/276698.276876
  30. mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data. In Similarity Search and Applications (Lecture Notes in Computer Science), Shin’ichi Satoh, Lucia Vadicamo, Arthur Zimek, Fabio Carrara, Ilaria Bartolini, Martin Aumüller, Björn Þór Jónsson, and Rasmus Pagh (Eds.). Springer International Publishing, Cham, 47–61. https://doi.org/10.1007/978-3-030-60936-8_4
  31. J.W. Jaromczyk and G.T. Toussaint. 1992. Relative neighborhood graphs and their relatives. Proc. IEEE 80, 9 (Sept. 1992), 1502–1517. https://doi.org/10.1109/5.163414 Conference Name: Proceedings of the IEEE.
  32. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc. https://papers.nips.cc/paper_files/paper/2019/hash/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Abstract.html
  33. Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search. In Computer Vision – ECCV 2008 (Lecture Notes in Computer Science), David Forsyth, Philip Torr, and Andrew Zisserman (Eds.). Springer, Berlin, Heidelberg, 304–317. https://doi.org/10.1007/978-3-540-88682-2_24
  34. Billion-scale similarity search with GPUs. http://arxiv.org/abs/1702.08734 arXiv:1702.08734 [cs].
  35. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (Jan. 2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  36. Dense Passage Retrieval for Open-Domain Question Answering. https://arxiv.org/abs/2004.04906v3
  37. Philip M. Lankford. 1969. Regionalization: Theory and Alternative Algorithms. Geographical Analysis 1, 2 (1969), 196–212. https://doi.org/10.1111/j.1538-4632.1969.tb00615.x _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1538-4632.1969.tb00615.x.
  38. D. T. Lee and B. J. Schachter. 1980. Two algorithms for constructing a Delaunay triangulation. International Journal of Computer & Information Sciences 9, 3 (June 1980), 219–242. https://doi.org/10.1007/BF00977785
  39. V. Lempitsky and A. Babenko. 2012. The inverted multi-index. IEEE Computer Society, 3069–3076. https://doi.org/10.1109/CVPR.2012.6248038 ISSN: 1063-6919.
  40. I/O Efficient Approximate Nearest Neighbour Search based on Learned Functions. 2020 IEEE 36th International Conference on Data Engineering (ICDE) (April 2020), 289–300. https://doi.org/10.1109/ICDE48307.2020.00032 Conference Name: 2020 IEEE 36th International Conference on Data Engineering (ICDE) ISBN: 9781728129037 Place: Dallas, TX, USA Publisher: IEEE.
  41. EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search. The VLDB Journal 30, 2 (March 2021), 215–235. https://doi.org/10.1007/s00778-020-00635-4
  42. Pre-trained Language Model for Web-scale Retrieval in Baidu Search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD ’21). Association for Computing Machinery, New York, NY, USA, 3365–3375. https://doi.org/10.1145/3447548.3467149
  43. David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 2 (Nov. 2004), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
  44. Kejing Lu and Mineichi Kudo. 2020. R2LSH: A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1045–1056. https://doi.org/10.1109/ICDE48307.2020.00095 ISSN: 2375-026X.
  45. VHP: approximate nearest neighbor search via virtual hypersphere partitioning. Proceedings of the VLDB Endowment 13, 9 (May 2020), 1443–1455. https://doi.org/10.14778/3397230.3397240
  46. Intelligent probing for locality sensitive hashing: multi-probe LSH and beyond. Proceedings of the VLDB Endowment 10, 12 (Aug. 2017), 2021–2024. https://doi.org/10.14778/3137765.3137836
  47. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems 45 (Sept. 2014), 61–68. https://doi.org/10.1016/j.is.2013.10.006
  48. Yu A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. http://arxiv.org/abs/1603.09320 arXiv:1603.09320 [cs].
  49. High-Throughput Vector Similarity Search in Knowledge Graphs. http://arxiv.org/abs/2304.01926 arXiv:2304.01926 [cs].
  50. Marius Muja and David G. Lowe. 2014. Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 11 (Nov. 2014), 2227–2240. https://doi.org/10.1109/TPAMI.2014.2321376 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  51. Gonzalo Navarro. 2002. Searching in metric spaces by spatial approximation. The VLDB Journal 11, 1 (Aug. 2002), 28–46. https://doi.org/10.1007/s007780200060
  52. Neighbor-sensitive hashing. Proceedings of the VLDB Endowment 9, 3 (Nov. 2015), 144–155. https://doi.org/10.14778/2850583.2850589
  53. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020 arXiv:2103.00020 [cs].
  54. TripClick: The Log Files of a Large Health Web Search Engine. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2507–2513. https://doi.org/10.1145/3404835.3463242 arXiv:2103.07901 [cs].
  55. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. https://doi.org/10.48550/arXiv.2111.02114 arXiv:2111.02114 [cs].
  56. Chanop Silpa-Anan and Richard Hartley. 2008. Optimised KD-trees for fast image descriptor matching. IEEE Computer Society, 1–8. https://doi.org/10.1109/CVPR.2008.4587638
  57. Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest Neighbor Search. http://arxiv.org/abs/2205.03763 arXiv:2205.03763 [cs].
  58. FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. https://doi.org/10.48550/arXiv.2105.09613 arXiv:2105.09613 [cs].
  59. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB Endowment 6, 14 (Sept. 2013), 1930–1941. https://doi.org/10.14778/2556549.2556574
  60. Godfried T. Toussaint. 1980. The relative neighbourhood graph of a finite planar set. Pattern Recognition 12, 4 (Jan. 1980), 261–268. https://doi.org/10.1016/0031-3203(80)90066-7
  61. Andrei Vasnetsov. [n. d.]. Filtrable HNSW - Qdrant. https://qdrant.tech/articles/filtrable-hnsw/
  62. Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 2614–2627. https://doi.org/10.1145/3448016.3457550
  63. Navigable Proximity Graph-Driven Native Hybrid Queries with Structured and Unstructured Constraints. http://arxiv.org/abs/2203.13601 arXiv:2203.13601 [cs].
  64. AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data. Proceedings of the VLDB Endowment 13, 12 (Aug. 2020), 3152–3165. https://doi.org/10.14778/3415478.3415541
  65. Brie Wolfson. 2023. Building chat langchain. https://blog.langchain.dev/building-chat-langchain-2/
  66. HQANN: Efficient and Robust Similarity Search for Hybrid Queries with Structured and Unstructured Constraints. http://arxiv.org/abs/2207.07940 arXiv:2207.07940 [cs].
  67. {VBASE}: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity. 377–395. https://www.usenix.org/conference/osdi23/presentation/zhang-qianxi
  68. SONG: Approximate Nearest Neighbor Search on GPU. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1033–1044. https://doi.org/10.1109/ICDE48307.2020.00094 ISSN: 2375-026X.
  69. PM-LSH: A fast and accurate LSH framework for high-dimensional approximate NN search. Proceedings of the VLDB Endowment 13, 5 (Jan. 2020), 643–655. https://doi.org/10.14778/3377369.3377374
Citations (4)

Summary

  • The paper introduces ACORN, a hybrid search method that efficiently indexes predicate subgraphs to unify vector and structured data search.
  • It reimagines the HNSW algorithm to achieve 2–10× higher QPS on low-cardinality sets and over 30× on complex, high-cardinality scenarios.
  • ACORN offers two variants—ACORN-γ for performance and ACORN-1 for efficient construction—enabling versatile and scalable deployment in real-world applications.

ACORN: Advancing Hybrid Search with Predicate-Agnostic Vector and Structured Data Indexing

Introduction to Hybrid Search Challenges

Hybrid search, which entails querying over both unstructured vector data and structured attributes, is central to numerous modern applications, from e-commerce platforms to scholarly article repositories. Despite its widespread utility, hybrid search presents significant computational challenges. Existing solutions often compromise either on search performance due to inefficient handling of mixed data types or on query expressiveness by restricting the types of searchable predicates. Addressing these limitations, the paper introduces ACORN, an approach designed to efficiently perform hybrid search across vectors and structured data without constraining predicate types.

ACORN Overview

ACORN stands for ANN Constraint-Optimized Retrieval Network. It reimagines the hierarchical navigable small world (HNSW) indexing algorithm to support hybrid querying effectively. ACORN introduces two variants: ACORN-γ\gamma, which emphasizes search performance, and ACORN-1, which optimizes for reduced construction overhead. The primary innovation lies in enabling search over predicate subgraphs -- subgraphs of the index where a given predicate is true. By ensuring these subgraphs resemble an ideal HNSW index, ACORN bridges the performance gap between traditional vector search and hybrid search needs.

Performance Benchmarks

In a comprehensive evaluation across several datasets, ACORN demonstrates impressive performance metrics:

  • LCPS Benchmarks: On low-cardinality predicate set (LCPS) benchmarks, which previous specialized indices can handle, ACORN-γ\gamma achieves 2--10×\times higher query per second (QPS) rates at 0.9 recall compared to these specialized methods.
  • HCPS Benchmarks: For high-cardinality predicate sets (HCPS), representing more complex real-world scenarios, ACORN-γ\gamma continues to outperform existing baselines by over 30×\times in QPS at equal recall levels.
  • Construction Efficiency: While ACORN-γ\gamma presents a higher time-to-index (TTI) compared to HNSW, it offers significant gains in search performance, justifying the trade-off. Conversely, ACORN-1 achieves a TTI that is on par or better than existing methods, making it a viable option for resource-constrained scenarios.

Technical Innovations

ACORN's strategy to traverse predicate subgraphs during search and its approach to construct denser, albeit more navigable, graphs are central to its efficiency. The introduction of a predicate-agnostic pruning strategy during construction and the flexibility in choosing neighbor expansion factors allow ACORN to adapt seamlessly across various datasets and query predicates.

Theoretical and Practical Implications

ACORN's design philosophy underscores a critical insight: hybrid search need not be confined by the limitations of existing data structures, nor should it compromise on query expressiveness. Practically, ACORN opens up new avenues for building more robust, efficient, and versatile search functionalities in applications that require dealing with complex, mixed-modality data.

Future Directions

The remarkable performance of ACORN in handling diverse datasets and query types suggest significant potential for future work. Exploring ACORN's adaptability to other graph-based indices and further optimizing its construction for even larger datasets are immediate next steps. Additionally, investigating the integration of ACORN into distributed search systems could further extend its utility and impact.

Conclusion

ACORN represents a significant step forward in the endeavor to provide efficient and expressive hybrid search capabilities. Its innovative approach to indexing and searching across mixed-modality data not only sets a new benchmark for performance but also broadens the horizon for query functionalities available to modern applications.

Youtube Logo Streamline Icon: https://streamlinehq.com