Pixel-Perfect Structure-from-Motion with Featuremetric Refinement

Published 18 Aug 2021 in cs.CV | (2108.08291v1)

Abstract: Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction. The classical image matching paradigm detects keypoints per-image once and for all, which can yield poorly-localized features and propagate large errors to the final geometry. In this paper, we refine two key steps of structure-from-motion by a direct alignment of low-level image information from multiple views: we first adjust the initial keypoint locations prior to any geometric estimation, and subsequently refine points and camera poses as a post-processing. This refinement is robust to large detection noise and appearance changes, as it optimizes a featuremetric error based on dense features predicted by a neural network. This significantly improves the accuracy of camera poses and scene geometry for a wide range of keypoint detectors, challenging viewing conditions, and off-the-shelf deep features. Our system easily scales to large image collections, enabling pixel-perfect crowd-sourced localization at scale. Our code is publicly available at https://github.com/cvg/pixel-perfect-sfm as an add-on to the popular SfM software COLMAP.

Abstract PDF Upgrade to Chat

Citations (157)

View on Semantic Scholar

Summary

The paper presents featuremetric refinement that adjusts keypoints and camera poses to significantly improve the accuracy of sparse 3D reconstructions.
It utilizes dense features from neural networks to align multi-view image details and mitigate geometric errors beyond traditional methods.
Experimental results demonstrate marked improvements in localization precision and triangulation, especially in scenarios with sparse observations.

Overview

The paper presents an advanced approach for improving the accuracy of sparse 3D reconstructions in Structure-from-Motion (SfM) tasks by introducing featuremetric refinement techniques. The authors focus on refining keypoints and camera poses with featuremetric errors, which are optimized using dense features predicted by a neural network. This method enhances the precision of camera poses and scene geometry across various detectors and challenging conditions.

Technical Summary

Traditional SfM methods rely on detecting keypoints in each image and using these to match across multiple views. However, this can result in poorly localized features, spreading large errors throughout the geometry. The approach delineated here refines keypoints at the outset and camera poses in a post-processing stage by optimizing a featuremetric error.

The method leverages dense features extracted through pre-trained convolutional neural networks to improve the alignment of image information from multiple viewpoints. Unlike the purely geometric optimization of traditional SfM, this featuremetric optimization capitalizes on local image details and robustness to appearance changes.

Key aspects include:

Featuremetric Keypoint Adjustment (KA): By correcting keypoint locations before geometric estimation, the method refines points using direct feature alignment rather than local geometric constraints.
Featuremetric Bundle Adjustment (BA): Following SfM, 3D points and camera poses are further refined with a featuremetric cost, offering increased accuracy through the rich local information contained in dense features.

Experimental Results

Numerous experiments show the improved accuracy and completeness of 3D reconstructions and camera poses when employing the proposed refinements across multiple configurations, including learned and hand-crafted features like SIFT, SuperPoint, D2-Net, and R2D2. Detailed results showed substantial improvements in camera localization precision and triangulation accuracy, particularly under conditions of sparse observations or significant appearance change.

The approach was notably effective in scenarios with lower observation numbers, where traditional geometric methods struggle to maintain accuracy.
Compared to previous methods like Patch Flow, the approach demonstrated marked improvements, especially when strict thresholds were applied (e.g., 1cm error thresholds).

Implications and Future Directions

The implications of this research are significant for fields like augmented reality, robotics, and computer vision. By improving the precision of SfM reconstructions and visual localization tasks, the method presents potential to enhance applications that rely on spatial intelligence.

Future developments could explore optimization of dense feature extraction for better computational performance, scalability, and handling larger-scale scene reconstruction. Developing tailored CNN models that further specialize in capturing context-relevant dense features efficiently would be a logical next step, potentially allowing adaptation to real-time or resource-constrained environments.

The contribution of releasing the code as an extension to the COLMAP software and other localization tools provides a useful asset for the community, enhancing capability for scalable, precise localization in varied and challenging scenarios. This sets a foundation for advancing the benchmarks and capabilities in accurate 3D mapping and pose estimation.

Markdown Report Issue