Dimensionality-Reduction Techniques for Approximate Nearest Neighbor Search: A Survey and Evaluation (2403.13491v2)
Abstract: Approximate Nearest Neighbor Search (ANNS) on high-dimensional vectors has become a fundamental and essential component in various machine learning tasks. Recently, with the rapid development of deep learning models and the applications of LLMs, the dimensionality of the vectors keeps growing in order to accommodate a richer semantic representation. This poses a major challenge to the ANNS solutions since distance calculation cost in ANNS grows linearly with the dimensionality of vectors. To overcome this challenge, dimensionality-reduction techniques can be leveraged to accelerate the distance calculation in the search process. In this paper, we investigate six dimensionality-reduction techniques that have the potential to improve ANNS solutions, including classical algorithms such as PCA and vector quantization, as well as algorithms based on deep learning approaches. We further describe two frameworks to apply these techniques in the ANNS workflow, and theoretically analyze the time and space costs, as well as the beneficial threshold for the pruning ratio of these techniques. The surveyed techniques are evaluated on six public datasets. The analysis of the results reveals the characteristics of the different families of techniques and provides insights into the promising future research directions.
- Principal component analysis. WIREs Computational Statistics, 2(4):433–459, 2010.
- Estimating local intrinsic dimensionality. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 29–38, New York, NY, USA, 2015. Association for Computing Machinery.
- Hd-index: Pushing the scalability-accuracy boundary for approximate knn search in high-dimensional spaces. Proc. VLDB Endow., 11(8):906–919, apr 2018.
- Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374, 2020.
- The role of local dimensionality measures in benchmarking nearest neighbor search. Information Systems, 101:101807, 2021.
- Elpis: Graph-based similarity search for scalable data science. Proc. VLDB Endow., 16(6):1548–1559, apr 2023.
- Kin-Pong Chan and Ada Wai-Chee Fu. Efficient time series matching by wavelets. In Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337), pages 126–133, 1999.
- Finger: Fast inference for graph-based approximate nearest neighbor search. In Proceedings of the ACM Web Conference 2023, page 3225–3235, New York, NY, USA, 2023. Association for Computing Machinery.
- Google news personalization: Scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, page 271–280. Association for Computing Machinery, 2007.
- Scaling graph-based anns algorithms to billion-size datasets: A comparative analysis, 2023.
- Lanns: A web-scale approximate nearest neighbor lookup system. Proc. VLDB Endow., 15(4):850–858, dec 2021.
- Return of the lernaean hydra: Experimental evaluation of data series approximate similarity search. Proc. VLDB Endow., 13(3):403–420, nov 2019.
- High-dimensional approximate nearest neighbor search: with reliable and efficient distance comparison operations. Proc. ACM Manag. Data, 1(1), may 2023.
- Optimized product quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):744–755, 2014.
- Tcmalloc: Thread-caching malloc. https://goog-perftools.sourceforge.net/doc/tcmalloc.html, 2022. Accessed: June 2023.
- Data series progressive similarity search with probabilistic quality guarantees. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, page 1857–1873, New York, NY, USA, 2020. Association for Computing Machinery.
- Accelerating large-scale inference with anisotropic vector quantization. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3887–3896. PMLR, 13–18 Jul 2020.
- Retrieval augmented language model pre-training. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR, 13–18 Jul 2020.
- Generalized product quantization network for semi-supervised image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Diskann: Fast accurate billion-point nearest neighbor search on a single node. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc., 2020.
- A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110, 2022.
- Approximate nearest neighbor search on high dimensional data — experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 32(8):1475–1488, 2020.
- Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2020.
- D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2161–2168, 2006.
- Efficient approximate nearest neighbor search in multi-dimensional databases. Proc. ACM Manag. Data, 1(1), may 2023.
- A survey on graph-based methods for similarity searches in metric spaces. Information Systems, 95:101507, 2021.
- Results of the neurips’21 challenge on billion-scale approximate nearest neighbor search. In Douwe Kiela, Marco Ciccone, and Barbara Caputo, editors, Proceedings of the NeurIPS 2021 Competitions and Demonstrations Track, volume 176 of Proceedings of Machine Learning Research, pages 177–189. PMLR, 06–14 Dec 2022.
- Milvus: A purpose-built vector data management system. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD ’21, page 2614–2627, New York, NY, USA, 2021. Association for Computing Machinery.
- A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow., 14(11):1964–1978, jul 2021.
- Deep learning embeddings for data series similarity search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, page 1708–1716. ACM, 2021.
- Dumpy: A compact and adaptive index for large data series collections. Proc. ACM Manag. Data, 1(1), may 2023.
- Efficient index construction and approximate nearest neighbor search in high-dimensional spaces. Proc. VLDB Endow., 16(8):1979–1991, 2023.
- Pm-lsh: A fast and accurate lsh framework for high-dimensional approximate nn search. Proc. VLDB Endow., 13(5):643–655, jan 2020.