Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds (2405.17813v1)

Published 28 May 2024 in cs.IR

Abstract: Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. Existing Approximate Nearest Neighbours (ANN) benchmarks and research typically has an over-reliance on simplistic datasets like MNIST or SIFT1M and fail to reflect the complexity of current use-cases. Our investigation focuses on HNSW's efficacy across a spectrum of datasets, including synthetic vectors tailored to mimic specific intrinsic dimensionalities, widely-used retrieval benchmarks with popular embedding models, and proprietary e-commerce image data with CLIP models. We survey the most popular HNSW vector databases and collate their default parameters to provide a realistic fixed parameterisation for the duration of the paper. We discover that the recall of approximate HNSW search, in comparison to exact K Nearest Neighbours (KNN) search, is linked to the vector space's intrinsic dimensionality and significantly influenced by the data insertion sequence. Our methodology highlights how insertion order, informed by measurable properties such as the pointwise Local Intrinsic Dimensionality (LID) or known categories, can shift recall by up to 12 percentage points. We also observe that running popular benchmark datasets with HNSW instead of KNN can shift rankings by up to three positions for some models. This work underscores the need for more nuanced benchmarks and design considerations in developing robust vector search systems using approximate vector search algorithms. This study presents a number of scenarios with varying real world applicability which aim to better increase understanding and future development of ANN algorithms and embedding

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. ANNOY library. URL https://github.com/spotify/annoy. Accessed: 2017-08-01.
  2. infgrad/stella-base-en-v2 · Hugging Face — huggingface.co. URL https://huggingface.co/infgrad/stella-base-en-v2. [Accessed 16-04-2024].
  3. Taylor AI. TaylorAI/bge-micro · Hugging Face — huggingface.co. URL https://huggingface.co/TaylorAI/bge-micro. [Accessed 16-04-2024].
  4. The role of local dimensionality measures in benchmarking nearest neighbor search. Information Systems, 101:101807, 2021. ISSN 0306-4379. doi:https://doi.org/10.1016/j.is.2021.101807. URL https://www.sciencedirect.com/science/article/pii/S0306437921000569.
  5. Ms marco: A human generated machine reading comprehension dataset, 2018.
  6. Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
  7. A full-text learning to rank dataset for medical information retrieval. 2016. URL http://www.cl.uni-heidelberg.de/~riezler/publications/papers/ECIR2016.pdf.
  8. Sebastian Bruch. Foundations of vector retrieval. arXiv preprint arXiv:2401.09350, 2024.
  9. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  10. Chroma. Chroma hnsw parameters, 2023. URL https://github.com/chroma-core/chroma/blob/bdec54a/chromadb/segment/impl/vector/hnsw_params.py.
  11. Specter: Document-level representation learning using citation-informed transformers. In ACL, 2020.
  12. Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 537–546, 2008.
  13. Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  14. The faiss library. 2024.
  15. Elastic. Elasticsearch dense vector, 2023. URL https://www.elastic.co/guide/en/elasticsearch/reference/8.11/dense-vector.html.
  16. hnswlib. Hnswlib github repository, 2023. URL https://github.com/nmslib/hnswlib/blob/3f3429661187e4c24a490a0f148fc6bc89042b3d/ALGO_PARAMS.md#search-parameters.
  17. Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian Document Computing Symposium, ADCS ’15, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450340403. doi:10.1145/2838931.2838934. URL https://doi.org/10.1145/2838931.2838934.
  18. Fast nearest neighbor search through sparse random projections and voting. In Big Data (Big Data), 2016 IEEE International Conference on, pages 881–888. IEEE, 2016.
  19. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  20. Diskann: Fast accurate billion-point nearest neighbor search on a single node. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf.
  21. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
  22. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  23. Ir evaluation methods for retrieving highly relevant documents. volume 20, pages 41–48, 07 2000. doi:10.1145/345508.345545.
  24. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011. doi:10.1109/TPAMI.2010.57.
  25. Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems, 17, 2004.
  26. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
  27. Graph based nearest neighbor search: Promises and failures. arXiv preprint arXiv:1904.02077, 2019.
  28. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
  29. Milvus. Milvus configuration, 2023. URL https://github.com/milvus-io/milvus/blob/601a8b801bfa1b3a69084bf0e63d32ea5bd31361/configs/milvus.yaml#L729.
  30. Mteb: Massive text embedding benchmark, 2023.
  31. Gpt-4 technical report, 2024.
  32. OpenSearch. Opensearch knn index, 2023. URL https://opensearch.org/docs/latest/search-plugins/knn/knn-index#method-definitions.
  33. pgvector. pgvector index options, 2023. URL https://github.com/pgvector/pgvector?tab=readme-ov-file#index-options.
  34. Qdrant. Qdrant indexing concepts, 2023. URL https://qdrant.tech/documentation/concepts/indexing/#vector-index.
  35. Learning transferable visual models from natural language supervision, 2021.
  36. LLM Rails. llmrails/ember-v1 · Hugging Face — huggingface.co. URL https://huggingface.co/llmrails/ember-v1. [Accessed 16-04-2024].
  37. Redis. Redis vector documentation, 2023. URL https://redis.io/docs/interact/search-and-query/advanced-concepts/vectors/.
  38. Facebook AI Research. Faiss hnsw documentation, 2023. URL https://faiss.ai/cpp_api/file/HNSW_8h.html.
  39. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
  40. Freshdiskann: A fast and accurate graph-based ann index for streaming similarity search, 2021.
  41. Pytrec_eval: An extremely fast python interface to trec_eval. In SIGIR. ACM, 2018.
  42. Vespa. Vespa hnsw index, 2023. URL https://docs.vespa.ai/en/reference/schema-reference.html#index-hnsw.
  43. Trec-covid: constructing a pandemic information retrieval test collection. SIGIR Forum, 54(1), feb 2021. ISSN 0163-5840. doi:10.1145/3451964.3451965. URL https://doi.org/10.1145/3451964.3451965.
  44. Retrieval of the best counterargument without prior topic knowledge. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.18653/v1/P18-1023. URL https://aclanthology.org/P18-1023.
  45. Fact or fiction: Verifying scientific claims, 2020.
  46. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  47. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672, 2024.
  48. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020.
  49. Weaviate. Weaviate vector index, 2023. URL https://weaviate.io/developers/weaviate/config-refs/schema/vector-index#hnsw-index-parameters.
  50. C-pack: Packaged resources to advance general chinese embedding, 2023.

Summary

  • The paper demonstrates that intrinsic dimensionality significantly degrades HNSW recall, with recall dropping by about 50% as full rank is approached.
  • It reveals that ordering data by descending Local Intrinsic Dimensionality can boost recall by up to 12.8 percentage points compared to ascending order.
  • The study shows that real-world evaluations can shift model rankings by up to three positions, urging a re-evaluation of current ANN benchmarks.

Analyzing the Performance of HNSW Vector Search Systems in Real-World Scenarios

This paper undertakes a comprehensive paper of the Hierarchical Navigable Small Worlds (HNSW) algorithm's efficacy across a range of datasets, particularly focusing on vectors created by contemporary deep learning models. The research addresses a significant gap in the existing literature, where most Approximate Nearest Neighbors (ANN) benchmarks rely heavily on simplistic datasets like MNIST or SIFT1M, which fail to capture the complexity inherent in modern AI applications.

The investigation systematically evaluates the impact of various factors, including the intrinsic dimensionality of vector spaces and the sequence of data insertion, on the recall performance of HNSW. This paper comprises tests on synthetic datasets, popular retrieval benchmarks with diverse embedding models, and proprietary e-commerce image data leveraging CLIP models. The methodologies and findings presented shed light on critical aspects of HNSW’s performance, urging a reconsideration of current benchmarking practices for ANN algorithms.

Key Findings

  1. Impact of Intrinsic Dimensionality on Recall
    • The paper reveals that the recall of approximate HNSW search, in comparison to exact KNN search, is intricately linked to the intrinsic dimensionality of the vector space. The researchers generated synthetic data with varying intrinsic dimensionalities using orthonormal basis vectors and evaluated the recall of HNSW implementations (HNSWLib and FAISS). Their findings indicate a significant degradation in recall as the intrinsic dimensionality increases.
    • Figures presented in the paper show a drop in recall by approximately 50% when the data approaches a full rank, exhibiting a clear dependency on the intrinsic dimensionality.
  2. Influence of Data Insertion Sequence
    • The sequence in which data is inserted into the HNSW index significantly affects recall. This is demonstrated by experiments where data ordered by descending Local Intrinsic Dimensionality (LID) achieves a higher recall compared to ascending LID or random order.
    • The average recall for descending LID order was found to be up to 12.8 percentage points higher than for ascending LID order, indicating a potential avenue for optimizing HNSW graph construction.
  3. Impact on Retrieval Benchmarks and Model Rankings
    • Standard retrieval benchmark datasets show that model rankings change when evaluated with various retrieval systems. This suggests that evaluations done with exact KNN may not be fully representative of those performed using approximate nearest neighbors.
    • This divergence in rankings is quantified with shifts of up to three positions on the leaderboard, emphasizing the need for benchmarks that reflect the peculiarities of approximate retrieval systems.
  4. Real-World Dataset Evaluation
    • Evaluations using real-world e-commerce datasets showed substantial variations in recall based on the order of data insertion and the chosen model. For example, with a fashion dataset, recall differences of up to 7.7 percentage points were observed when varying insertion sequences even for different architectures of CLIP models.
    • The inclusion of practical datasets underscores the applicability of their findings beyond controlled experimental settings, suggesting that the insights gained can translate to real-world improvements.

Implications and Future Directions

The implications of this research are manifold. Practically, it suggests that the construction of HNSW indices could be optimized by considering the intrinsic properties of the data, such as intrinsic dimensionality and local neighborhood structures. Theoretically, it calls for a re-evaluation of current benchmarks and encourages the development of more nuanced evaluation methodologies that better reflect the complexities of practical applications.

The insight that model selection for HNSW-based retrieval systems requires more than just adherence to exact KNN benchmarks warrants significant attention. It suggests that models need to be evaluated in the context of their intended use-case environments, particularly when deployed in approximate retrieval systems.

Conclusion

This paper provides a robust analysis of the HNSW algorithm's performance across various datasets and scenarios, highlighting critical factors that influence recall. It advocates for refined benchmarking practices and offers actionable insights into optimizing approximate nearest neighbor search systems. Future research should expand on these findings by exploring similar properties in other approximate retrieval algorithms, aiming to enhance their robustness and performance in real-world applications.