Fast Exact Search in Hamming Space with Multi-Index Hashing (1307.2982v3)

Published 11 Jul 2013 in cs.CV, cs.AI, cs.DS, and cs.IR

Abstract: There is growing interest in representing image data and feature descriptors using compact binary codes for fast near neighbor search. Although binary codes are motivated by their use as direct indices (addresses) into a hash table, codes longer than 32 bits are not being used as such, as it was thought to be ineffective. We introduce a rigorous way to build multiple hash tables on binary code substrings that enables exact k-nearest neighbor search in Hamming space. The approach is storage efficient and straightforward to implement. Theoretical analysis shows that the algorithm exhibits sub-linear run-time behavior for uniformly distributed codes. Empirical results show dramatic speedups over a linear scan baseline for datasets of up to one billion codes of 64, 128, or 256 bits.

Citations (160)

View on Semantic Scholar

Summary

The paper introduces multi-index hashing that partitions binary codes to enable sublinear exact kNN search, drastically reducing candidate examinations in Hamming space.
Empirical results show dramatic speedups, with retrieval times as low as 50ms for one billion 64-bit codes, outperforming traditional linear scan methods.
The method scales across various code lengths and offers promising applications in image retrieval and other large-scale, high-dimensional data scenarios.

An Expert Overview of "Fast Exact Search in Hamming Space with Multi-Index Hashing"

The paper "Fast Exact Search in Hamming Space with Multi-Index Hashing" by Mohammad Norouzi, Ali Punjani, and David J. Fleet explores the domain of efficient data retrieval using binary codes. It specifically focuses on the challenge of conducting exact k-nearest neighbor (kNN) search in Hamming space, addressing limitations in existing methods that employ lengthy binary codes as direct indices in hash tables.

Core Contributions

Innovative Multi-Index Hashing: The paper introduces a method that partitions binary codes into multiple disjoint substrings, indexing each separately to facilitate sub-linear exact kNN search. This multi-index hashing approach mitigates the inefficiencies faced in previous methods that required exhaustive inspection of potential neighbor candidates within expansive Hamming balls.
Sublinear Time Search: A crucial advancement presented is the demonstration of sublinear time complexity for the search process under the assumption of uniformly distributed codes. Theoretical analysis establishes that this method achieves significant speedups over linear scan methods by optimally leveraging multiple hash tables.
Empirical Validation: Empirical results underpin the theoretical claims, showcasing dramatic speedups on datasets containing up to one billion binary codes. With code lengths of 64, 128, and 256 bits, multi-index hashing outperforms traditional linear scan techniques significantly, particularly in cases involving 64-bit codes.

Strong Numerical Results and Claims

Efficiency with Large Datasets: The paper reports speedups of several hundred times over linear scan methods when retrieving 1000-NN on databases consisting of one billion 64-bit codes, with execution times reaching as fast as 50 milliseconds.
Effectiveness Across Code Lengths: For databases constructed from 1 billion 128-bit and 256-bit codes, substantial speedups are similarly observed, although the gain decreases as the code length increases.

Implications and Future Directions

Scalability and Applicability: The method's applicability to large-scale datasets underpins its potential utility in real-world applications such as image retrieval, where rapid lookup times for high-dimensional descriptors are critical.
Algorithmic Modifications: While the primary focus is on exact search, adaptations of the algorithm for approximate retrieval tasks present intriguing possibilities for further enhancements. Embracing approximation could yield even greater efficiency while maintaining satisfactory retrieval performance in practical scenarios.
Advanced Bit Assignment: Although explored superficially in the paper, further examination of bit assignment strategies to substring hash tables offers fertile ground for improving individual substring discriminability, potentially reducing candidate set sizes and boosting efficiency.

In conclusion, the research outlined in this paper represents a significant step forward in handling vast databases encoded in binary form, particularly in the efficient resolution of the kNN problem. By balancing the computational load across multiple hash tables, the proposed multi-index hashing enables effective and swift neighbor retrieval, setting a strong foundation for subsequent explorations and optimizations within high-dimensional discrete spaces.