RelPose++: Recovering 6D Poses from Sparse-view Observations (2305.04926v2)

Published 8 May 2023 in cs.CV

Abstract: We address the task of estimating 6D camera poses from sparse-view image sets (2-8 images). This task is a vital pre-processing stage for nearly all contemporary (neural) reconstruction algorithms but remains challenging given sparse views, especially for objects with visual symmetries and texture-less surfaces. We build on the recent RelPose framework which learns a network that infers distributions over relative rotations over image pairs. We extend this approach in two key ways; first, we use attentional transformer layers to process multiple images jointly, since additional views of an object may resolve ambiguous symmetries in any given image pair (such as the handle of a mug that becomes visible in a third view). Second, we augment this network to also report camera translations by defining an appropriate coordinate system that decouples the ambiguity in rotation estimation from translation prediction. Our final system results in large improvements in 6D pose prediction over prior art on both seen and unseen object categories and also enables pose estimation and 3D reconstruction for in-the-wild objects.

References (65)

Citations (41)

View on Semantic Scholar

Summary

The paper introduces a transformer-based multi-view framework that accurately estimates 6D camera poses from as few as 2 images.
The method decouples rotation and translation tasks using a novel world coordinate system, achieving a 10% improvement in rotation accuracy and enhanced translation predictions.
Evaluations demonstrate significant performance gains over methods like COLMAP, enabling high-fidelity sparse-view 3D reconstructions in real-world scenarios.

RelPose++: Sparse-View 6D Pose Estimation

Introduction

In this exploration, we analyze the capabilities of RelPose++, a robust framework for recovering 6D camera poses from sparse-view image sets ranging from 2 to 8 images. RelPose++ builds on the recent RelPose framework, addressing its limitations by utilizing transformer-based modules to incorporate multi-view cues and extending capabilities to predict camera translations. This paper demonstrates significant improvements in pose accuracy, benefiting various downstream applications such as 3D reconstruction.

Methodology

Multi-View Rotation and Translation

RelPose++ extends the RelPose framework by introducing a transformer-based module for processing multiple images simultaneously. This multi-view integration allows the system to resolve rotational ambiguities inherent in pairs of images, notably improving estimation accuracy for images with symmetric features, such as a mug with an obscured handle (Figure 1).

Figure 1: Overview of RelPose++. We present RelPose++, a method for sparse-view camera pose estimation. RelPose++ starts by extracting global image features using a ResNet 50.

The framework predicts rotations using an energy-based model similar to RelPose, but augments this with a consistent set of global rotations extracted through maximum spanning tree methods and coordinate ascent optimization.

Translation Prediction

A central extension in RelPose++ is its capacity to predict transient camera translations. It defines a world coordinate centered at the intersection of optical axes, a method that decouples rotation and translation tasks (Figure 2).

Figure 2: Coordinate Systems for Estimating Camera Translation. This helps decouple the task of predicting camera translations from rotations.

This approach circumvents the limitations of using the first camera as the frame origin, thereby stabilizing predictions even in cases of symmetric ambiguities.

Evaluation

Quantitative Results

RelPose++ demonstrates significant improvements in 6D pose prediction over alternatives like COLMAP and PoseDiffusion, particularly in scenarios with object symmetry or limited visual features (Table 1). It consistently outperforms these methods, verifying its rotation and translation prediction capabilities through rigorous benchmarking.

Rotation Accuracy: The method shows a $10\%$ improvement over previous art concerning unseen categories and handles rotations with $15^\circ$ accuracy effectively.
Translation Accuracy: Using the look-at centered coordinate system, RelPose++ predicts translations that are measurably more accurate, validated by an optimal similarity transform application.

Qualitative Analysis

Qualitative results indicate RelPose++'s ability to generalize to real-world, in-the-wild captures, such as self-photographed scenes (Figure 3). The capability to initialize sparse-view 3D reconstructions with high fidelity further underscores its utility in practical applications (Figure 4).

Figure 3: Recovered Camera Poses from In-the-Wild Images.

Figure 4: Sparse-view 3D Reconstruction using NeRS.

Discussion

RelPose++ provides a robust mechanism for recovering sparse-view camera poses, showcasing impressive generalization and accuracy. While the methodology currently focuses on offline processing, it offers insights into potential real-time applications when combined with technique refinements. Its model could benefit from further integration into existing 3D modeling pipelines, driving improvements in fields demanding precise spatial awareness.

Conclusion

The advancements in RelPose++ exemplify a crucial step towards comprehensive sparse-view pose estimation, introducing a method that effectively separates rotational and translational estimations for improved accuracy. Future avenues include deploying these strategies in dynamic environments and extending capabilities through fusion with real-time systems, promising widespread implications across robotics, AR/VR, and beyond.