6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics
(2303.13241v4)
Published 23 Mar 2023 in cs.CV and cs.RO
Abstract: We present a novel technique to estimate the 6D pose of objects from single images where the 3D geometry of the object is only given approximately and not as a precise 3D model. To achieve this, we employ a dense 2D-to-3D correspondence predictor that regresses 3D model coordinates for every pixel. In addition to the 3D coordinates, our model also estimates the pixel-wise coordinate error to discard correspondences that are likely wrong. This allows us to generate multiple 6D pose hypotheses of the object, which we then refine iteratively using a highly efficient region-based approach. We also introduce a novel pixel-wise posterior formulation by which we can estimate the probability for each hypothesis and select the most likely one. As we show in experiments, our approach is capable of dealing with extreme visual conditions including overexposure, high contrast, or low signal-to-noise ratio. This makes it a powerful technique for the particularly challenging task of estimating the pose of tumbling satellites for in-orbit robotic applications. Our method achieves state-of-the-art performance on the SPEED+ dataset and has won the SPEC2021 post-mortem competition.
The paper introduces EagerNet, a deep learning framework that predicts dense 2D-3D correspondences and per-pixel error estimates for robust 6D pose estimation.
It employs an asymmetric encoder-decoder architecture with probabilistic region-based refinement to counteract inaccuracies in approximate 3D models.
The method bridges the sim2real gap in orbital robotics, achieving high accuracy in on-orbit tasks like satellite servicing and debris removal.
This paper (6D Object Pose Estimation from Approximate 3D Models for Orbital Robotics, 2023) presents EagerNet, a novel approach for 6D object pose estimation from single images, specifically designed for challenging orbital robotics scenarios where only approximate 3D models of the target object are available. The core idea is to use a deep learning model to predict dense 2D-to-3D correspondences, along with a pixel-wise estimate of the error in these predictions, enabling robustness to both difficult visual conditions and inaccuracies in the 3D model used for training.
The proposed framework operates in several steps:
Object Detection: A 2D object detector first identifies the object in the input image and provides a Region of Interest (RoI).
Feature Prediction (EagerNet): The RoI is fed into EagerNet, an asymmetric encoder-decoder Convolutional Neural Network (CNN). EagerNet predicts four key pixel-wise maps:
Normalized 3D model coordinates ($\widehat{\bm{I}_{\tilde{q}}$): Regresses the 3D coordinates on the object surface for each pixel.
Coordinate prediction error ($\widehat{\bm{I}_e$): Estimates the expected L1 error for the predicted 3D coordinates at each pixel. This is a crucial innovation to handle model inaccuracies and harsh visual effects.
Foreground confidence ($\widehat{\bm{I}_o$): Predicts whether a pixel belongs to the object mask.
Surface region labels ($\widehat{\bm{I}_r$): Segment the object into different surface regions to help handle symmetries.
Multi-Hypothesis Generation: Using the predicted 2D-3D correspondences ($\widehat{\bm{I}_{\tilde{q}$) and the estimated error map ($\widehat{\bm{I}_e$), multiple pose hypotheses are generated. This is achieved by selecting correspondences below different error thresholds (ε) and solving for the 6D pose using the Perspective-n-Point (PnP) algorithm. Using different thresholds allows the system to adapt to varying levels of prediction confidence.
Learned Region-based Pose Refinement: Each generated pose hypothesis is refined using an iterative process. The paper adapts a probabilistic region-based refinement method by integrating learned features, specifically the foreground confidence map ($\widehat{\bm{I}_o$), alongside traditional color-based region statistics. This hybrid approach allows the refinement to work effectively even in space conditions where color alone may not be discriminative. The refinement optimizes the pose to best explain the observed segmentation (learned confidence + color histograms).
Hypothesis Selection: The probabilistic formulation of the refinement allows calculating a confidence score for each refined pose hypothesis. The hypothesis with the highest probability is selected as the final 6D pose estimate.
Implementation Details and Considerations:
Training Data: The method relies on synthetic training data. A key practical aspect is generating the required annotations (normalized 3D coordinates, mask, region labels) using an approximate 3D model of the target object, as done for the SPEED+ dataset experiments. This demonstrates the method's ability to work without a perfectly accurate model.
Network Architecture: An asymmetric encoder-decoder structure is used, similar to state-of-the-art pose estimation networks. ConvNeXt-Base [Liu_undated-mg] pre-trained on ImageNet is used as the encoder, with a decoder specifically designed for pixel-wise predictions.
Loss Function: The total loss is a weighted sum of L1 loss for normalized coordinates, Binary Cross Entropy (BCE) for the object mask, bounded Mean Squared Error (MSE) for the error prediction, and Cross Entropy (CE) for surface regions.
Sim2Real Gap: Bridging the gap between synthetic training data and real-world test data (especially challenging orbital images) is addressed through extensive data augmentation. The authors employ standard augmentations and introduce space-specific ones, such as synthetic specular reflections, which are shown to be highly effective.
Multi-Hypothesis Thresholds: Instead of a static threshold, an adaptive error threshold selection or generating multiple hypotheses across a range of thresholds is crucial for robustness, especially when combined with refinement and probabilistic selection.
Refinement Integration: The integration of learned confidence Io into the probabilistic region-based refinement formula (Eq. \ref{eq:method:pwp_theta}) is a practical way to leverage the network's learned understanding of object segmentation for robust pose optimization.
Test-Time Augmentations/Ensembling: Rotating the input image by multiples of 90 degrees and averaging predictions or combining refined poses from different rotations is shown to significantly improve performance on the SPEED+ dataset, compensating for potential biases in synthetic training data orientation.
Computational Requirements: Running multiple hypotheses and refinement steps increases computational cost compared to a single-shot approach. This is a trade-off for improved accuracy and robustness. Deployment on resource-constrained orbital platforms would require careful optimization or potentially pruning the number of hypotheses/refinement steps.
Object Detection Quality: The pipeline is sequential, starting with 2D detection. The quality of the initial bounding box directly impacts the cropped input to EagerNet and thus the final pose accuracy. Iterative bounding box refinement based on network outputs is proposed to mitigate this.
Model Quality: The TUD-L experiments [hodan_bop_2018] explicitly demonstrate that the error-aware prediction and adaptive thresholding make the method more resilient to inaccuracies in the 3D model used for training compared to approaches that don't explicitly model prediction errors.
Applications:
The primary application demonstrated is 6D pose estimation for uncooperative satellites, vital for on-orbit servicing, debris removal, and assembly. The robustness to harsh lighting, reflections, low signal-to-noise ratio, and approximate 3D models makes it particularly suitable for the space environment. The method's performance on the SPEED+ dataset, winning the post-mortem SPEC2021 competition, validates its effectiveness in this domain.
Practical Takeaways:
Explicitly modeling and predicting per-pixel correspondence errors is a valuable technique when dealing with imperfect 3D models or challenging image conditions.
Integrating learned features (like segmentation confidence) into classical pose refinement methods can significantly boost performance in difficult visual environments.
Extensive, domain-aware data augmentation is critical for bridging the Sim2Real gap in orbital robotics datasets.
Generating and refining multiple pose hypotheses, guided by prediction uncertainty, is a robust strategy when a single prediction might be unreliable.
Simple test-time techniques like input image rotation and ensembling can yield significant performance gains.
In summary, EagerNet provides a robust, deep learning-based solution for a critical task in orbital robotics, demonstrating state-of-the-art performance by effectively handling challenges posed by approximate 3D models and severe visual conditions through error-aware prediction, multi-hypothesis testing, and learned-feature-enhanced refinement.