Self-Taught Hashing for Fast Similarity Search (1004.5370v1)

Published 29 Apr 2010 in cs.IR

Abstract: The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Although some recently proposed techniques are able to generate high-quality codes for documents known in advance, obtaining the codes for previously unseen documents remains to be a very challenging problem. In this paper, we emphasise this issue and propose a novel Self-Taught Hashing (STH) approach to semantic hashing: we first find the optimal $l$-bit binary codes for all documents in the given corpus via unsupervised learning, and then train $l$ classifiers via supervised learning to predict the $l$-bit code for any query document unseen before. Our experiments on three real-world text datasets show that the proposed approach using binarised Laplacian Eigenmap (LapEig) and linear Support Vector Machine (SVM) outperforms state-of-the-art techniques significantly.

Citations (385)

View on Semantic Scholar

Summary

The paper introduces a two-stage learning process, first using unsupervised Laplacian Eigenmap for binary code generation and then SVM for predicting codes of unseen documents.
It demonstrates superior performance over methods like Spectral Hashing through extensive experiments on datasets such as Reuters21578 and 20Newsgroups.
The method significantly improves precision and recall in similarity searches, paving the way for scalable applications in text categorization and multimedia retrieval.

Overview of "Self-Taught Hashing for Fast Similarity Search"

The paper introduces Self-Taught Hashing (STH), a novel approach to enable efficient similarity search through semantic hashing. The authors tackle the challenge of generating compact binary codes that preserve the semantic similarity of documents, particularly addressing the difficulty of coding previously unseen documents. The proposed method involves a two-stage learning process: the first stage employs unsupervised learning to derive optimal binary codes for a document corpus, and the second stage uses supervised learning to predict these codes for new documents.

Key Contributions

Unsupervised Learning for Initial Code Generation: The authors employ a binarized version of Laplacian Eigenmap (LapEig) to generate initial binary codes from a corpus through unsupervised learning. By leveraging local similarity structures through k-nearest-neighbor (k-NN) graphs, this approach ensures that the codes are both similarity-preserving and entropy-maximizing, crucial for effective hashing.
Supervised Learning for Code Prediction: To handle queries that comprise unseen documents, the paper introduces a supervised learning approach using linear Support Vector Machines (SVM). Each bit in the binary code is predicted using a separate classifier derived from pseudo-labels generated in the first stage, allowing for efficient handling of out-of-sample data.
Performance and Evaluation: STH underwent extensive testing on three significant datasets—Reuters21578, 20Newsgroups, and TDT2. The experiments demonstrated that STH consistently outperforms existing methods such as binarized-LSI, Laplacian Co-Hashing (LCH), and Spectral Hashing (SpH) in both accuracy and efficiency. These results affirm the practicality of STH in real-world scenarios where quick and reliable document retrieval is vital.

Numerical Results

The experiments, using a code length up to 64 bits and varying the Hamming ball radius, reveal that STH substantially improves precision and recall rates over state-of-the-art techniques. For instance, when utilizing a 16-bit code at a Hamming ball radius of 1, the improvement was statistically significant with a P-value < 0.01. These results indicate the robustness of STH in maintaining and retrieving semantically similar documents under practical constraints.

Implications and Future Research

The implications of STH are two-fold. Practically, it provides a scalable and rapid method for similarity search in large datasets, benefiting applications like text categorization and multimedia retrieval. Theoretically, it opens avenues for integrating manifold learning with classification techniques to solve dual-stage learning problems more efficiently.

Future developments could include exploring alternative unsupervised techniques for the first stage and further enhancing supervised learning methods to accommodate even more complex and diverse data distributions. Moreover, integrating STH with distributed computing paradigms may enhance its scalability further, paving the way for its application in big data contexts.

In conclusion, the Self-Taught Hashing methodology presents a significant step forward in the generation and utilization of binary codes for similarity search, showcasing notable improvements in both effectiveness and computational efficiency. As research progresses, it holds promise for accommodating a broader range of Information Retrieval (IR) and data mining tasks.

PDF Markdown