D2-Net: A Trainable CNN for Joint Detection and Description of Local Features (1905.03561v1)

Published 9 May 2019 in cs.CV

Abstract: In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.

Authors (7)

Mihai Dusmanu (11 papers)
Ignacio Rocco (19 papers)
Tomas Pajdla (38 papers)
Marc Pollefeys (230 papers)
Josef Sivic (78 papers)
Akihiko Torii (10 papers)
Torsten Sattler (72 papers)

Citations (608)

View on Semantic Scholar

Summary

The paper introduces a unified CNN framework that jointly detects and describes local image features, enhancing robustness in appearance variations.
It employs dense descriptor extraction with delayed keypoint detection to leverage high-level image structures for improved matching.
Experimental results on datasets like Aachen Day-Night validate its state-of-the-art performance, especially in challenging, low-texture scenes.

Overview of D2-Net: A Trainable CNN for Joint Description and Detection of Local Features

The paper focuses on the challenge of identifying reliable pixel-level correspondences in difficult imaging scenarios. The authors introduce D2-Net, which employs a single Convolutional Neural Network (CNN) to perform both feature detection and description simultaneously. This approach contrasts with traditional pipelines that separate these tasks, where an early-stage detection precedes feature description.

Core Contributions

Unified Architecture: D2-Net innovatively combines detection and description in a unified framework. By delaying keypoint detection, the network focuses on higher-level, more stable image structures, which enhances robustness against significant appearance changes such as day-to-night transitions.
Dense Correspondence Extraction: The network extracts dense descriptors across the entire image, postponing keypoint detection to leverage rich, multilayered information. This approach addresses the inadequacies in traditional detectors that rely heavily on low-level image cues, which tend to be unstable in challenging conditions.
Training with Large-Scale Data: The model is trained using pixel correspondences sourced from large-scale Structure-from-Motion (SfM) reconstructions, eliminating the need for additional annotations. The ability to leverage readily available large datasets for training is a practical advantage.
State-of-the-art Performance: The method demonstrates superior performance on the Aachen Day-Night localization dataset and competitive results on other benchmarks, verifying its efficacy in real-world tasks like image matching and 3D reconstruction.

Numerical Results and Claims

The D2-Net excels in scenarios with extreme appearance variations, outperforming existing methods in both day-night image matching and localization tasks. It is particularly effective in environments with weak texture, establishing a new standard for sparse feature detection and description under these conditions.

Implications and Future Directions

The integration of description and detection stages within a single CNN represents a significant step forward, suggesting a paradigm shift in feature extraction methodologies. This approach could enhance not only visual localization and 3D reconstruction but potentially benefit other domains such as robotic vision and augmented reality.

Future research could explore optimizing keypoint localization accuracy without sacrificing the robustness achieved through high-level feature aggregation. Additionally, embedding considerations for computational efficiency and memory usage in larger-scale applications could widen D2-Net's applicability.

In conclusion, this paper contributes a methodologically sound and computationally efficient feature extraction framework that could influence future developments in visual recognition and image analysis technologies. The promising results in challenging conditions highlight the potential for further innovations in robust visual feature engineering.

PDF Markdown

Related Papers

GitHub

GitHub - mihaidusmanu/d2-net: D2-Net: A Trainable CNN for Joint Description and Detection of Local Features (832 stars)