Billion-scale similarity search with GPUs

Published 28 Feb 2017 in cs.CV, cs.DB, cs.DS, and cs.IR | (1702.08734v1)

Abstract: Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. We propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (3,348)

View on Semantic Scholar

Summary

The paper introduces a novel GPU-based k-selection algorithm that attains up to 55% of peak performance, boosting similarity search efficiency.
It proposes a near-optimal layout for exact and approximate k-NN searches by fusing distance computation with selection to reduce data passes.
Extensive experiments on datasets like Sift1B and DEEP1B show its scalability and speed, delivering up to 8.5x faster query times than previous methods.

Billion-Scale Similarity Search with GPUs

The paper "Billion-scale similarity search with GPUs" by Jeff Johnson, Matthijs Douze, and Hervé Jégou from Facebook AI Research presents an advanced approach for conducting large-scale similarity searches using GPUs. This research addresses critical challenges in processing high-dimensional data, particularly images and videos, by leveraging the capabilities of GPUs to achieve significant performance improvements.

Core Contributions

The paper's key contributions are:

GPU $k$ -Selection Algorithm: A novel GPU-based $k$ -selection algorithm operates at up to 55% of theoretical peak performance. This algorithm is flexible and can be fused with other kernels.
Optimal Algorithmic Layout: The paper proposes a near-optimal layout for exact and approximate $k$ -nearest neighbor (k-NN) search on GPUs.
Comprehensive Performance Evaluation: Extensive experiments demonstrate the algorithm’s superior performance compared to previous state-of-the-art methods on both mid-size and large-scale datasets.

Overview of Techniques

The paper starts by discussing the context and notation relevant to similarity search, highlighting the complexity and necessity of efficient algorithms due to the curse of dimensionality. Different approaches like brute-force, approximate, and compressed-domain searches are reviewed. Among these, product quantization (PQ) is noted for its effectiveness in vector compression and retrieval, which the authors leverage extensively.

GPU $k$ -Selection Algorithm

Traditionally, heaps have been used for $k$ -selection on CPUs, but these do not translate well to GPUs due to their serial nature. To address this, the authors present WarpSelect, an algorithm that maintains intermediate state entirely in GPU registers, using techniques like odd-size sorting networks and in-register sorting. This allows the algorithm to operate efficiently with a single pass over the data, avoiding the performance pitfalls of multi-pass or partitioning methods.

Exact and Approximate $k$ -NN Search

For exact k-NN search, the algorithm utilizes optimized GEMM routines from the cuBLAS library for computing partial distance matrices. Importantly, the $k$ -selection step is fused with distance computation, minimizing the number of passes over the data and thus enhancing memory throughput.

For approximate k-NN search, the paper details the implementation of the IVFADC index based on product quantization. The approach includes precomputing tables for faster distance calculations and using shared memory to efficiently handle large-scale data.

Experimental Results

The authors present detailed performance evaluations on multiple datasets, including Sift1M, Sift1B, and DEEP1B. Significant improvements are reported:

In the Sift1B dataset, the proposed method achieves a recall at 10 (R@10) of 0.376 in 17.7 microseconds per query vector, which is 8.5 times faster than the previous GPU-based state-of-the-art.
For exact k-means clustering on the MNIST8m dataset, their implementation is more than twice as fast as BIDMach, a leading GPU-based k-means library.

The paper also explores the construction of k-NN graphs for extremely large datasets. For instance, on the Yfcc100M dataset, a high-accuracy k-NN graph is constructed in just 35 minutes. Similarly, constructing a k-NN graph for the DEEP1B dataset using 4 GPUs takes less than 12 hours, showcasing the scalability of the proposed methods.

Practical and Theoretical Implications

The proposed algorithms have significant implications for both practical applications and theoretical research in the field of similarity search and database systems:

Practical Implications: The methods provide practical solutions for large-scale applications requiring high-throughput similarity searches, such as image retrieval, video search, and machine learning model training.
Theoretical Implications: The research contributes to the understanding of efficient algorithm design for heterogeneous architectures like GPUs, particularly in the context of high-dimensional data.

Future Directions

Potential future developments could include:

Further Algorithmic Optimization: Exploring additional optimizations and adaptations for newer GPU architectures to push closer to theoretical performance limits.
Distributed Similarity Search: Extending the current approach to multi-node GPU clusters to handle even larger datasets and further improve search times.
Broader Applications: Adapting the algorithms for other types of complex data beyond images and videos, like genomic data or large-scale text embeddings.

In conclusion, this paper provides a detailed and highly performant approach to similarity search on GPUs, setting a new benchmark for future research in this domain. The open-sourced implementation further facilitates reproducibility and comparison, enabling broader adoption and continued innovation in billion-scale similarity search applications.

Markdown Report Issue