Working hard to know your neighbor's margins: Local descriptor learning loss (1705.10872v4)

Published 30 May 2017 in cs.CV

Abstract: We introduce a novel loss for learning local feature descriptors which is inspired by the Lowe's matching criterion for SIFT. We show that the proposed loss that maximizes the distance between the closest positive and closest negative patch in the batch is better than complex regularization methods; it works well for both shallow and deep convolution network architectures. Applying the novel loss to the L2Net CNN architecture results in a compact descriptor -- it has the same dimensionality as SIFT (128) that shows state-of-art performance in wide baseline stereo, patch verification and instance retrieval benchmarks. It is fast, computing a descriptor takes about 1 millisecond on a low-end GPU.

Citations (643)

View on Semantic Scholar

Summary

The paper introduces a novel triplet-based loss that integrates Lowe's matching criterion to enhance local descriptor learning.
It employs efficient batch-based hard-negative mining within the L2Net architecture for robust performance in diverse computer vision tasks.
Empirical evaluations show HardNet matching SIFT's dimensionality while excelling in patch verification, wide baseline stereo, and image retrieval.

Local Descriptor Learning with HardNet

The paper introduces a novel loss function for metric learning that enhances local descriptor learning in computer vision tasks, inspired by Lowe's matching criterion for SIFT. This approach is versatile, showing effectiveness across both shallow and deep convolutional network architectures. The method is applied to the L2Net architecture, resulting in the HardNet descriptor, which matches the SIFT descriptor's dimensionality (128) and achieves state-of-the-art performance in multiple domains including patch verification, wide baseline stereo, and instance retrieval.

Local feature matching remains a cornerstone of various computer vision tasks such as image retrieval, panorama stitching, and 3D reconstruction. Despite the advances in end-to-end learning models, classical descriptors like SIFT prevail due to their robustness and integration capabilities. The paper seeks to bridge the gap between hand-crafted and learned descriptors by addressing the limitations of current methods, which often lack performance in practical applications. Previous works like TFeat and L2Net laid the groundwork by employing siamese architectures, yet these models did not thoroughly optimize the use of hard negatives.

Methodology

The authors propose a unique triplet-based loss that directly integrates Lowe's criterion into the learning process. The HardNet approach identifies the closest positive and negative examples in a batch, prioritizing maximizing their distance. This method contrasts with previous hard-negative mining strategies that often lead to data noise and suboptimal triplet selection.

The architecture of HardNet follows the L2Net design, with zero-padding in convolutional layers, batch normalization, ReLU activations, and dropout regularization. An important innovation is the efficient computation of the distance matrix within a single forward pass, leveraging GPU capabilities. This efficiency also leads to reduced memory usage and computation time compared to traditional triplet learning.

Empirical Evaluation

The authors conducted rigorous testing on multiple datasets to validate HardNet's capabilities:

Patch Descriptor Benchmarking: Using the HPatches benchmark, HardNet demonstrated superior matching and retrieval performances compared to existing methods like L2Net and TFeat. The descriptor's robustness to both geometrical and illumination changes was highlighted.
Wide Baseline Stereo: On the W1BS dataset, HardNet was effective in handling different modalities and viewpoint variations, showcasing high generalization capabilities.
Image Retrieval: Evaluated on Oxford5k and Paris6k datasets in a BoW setup, HardNet consistently outperformed previous methods, particularly when enhancing retrieval with spatial verification and query expansion.

Implications and Future Work

The introduction of HardNet represents a significant step forward in descriptor learning, bridging the gap between hand-crafted reliability and the adaptability of learned models. Its efficient computation and superior performance across varied benchmarks suggest potential for broader adoption in practical vision systems.

Future work could explore the extension of this learning strategy to other modalities and applications where local descriptors play a critical role. Additionally, further investigation into scaling the method for larger and more diverse training datasets could yield even more robust performance outcomes.

In conclusion, the proposed method not only offers a compact and high-performing descriptor solution but also sets a precedent for utilizing batch-based hard-negative mining in local descriptor learning, challenging the prevailing methodologies and prompting a reevaluation of metric learning strategies in computer vision.

PDF Markdown